Hacker News new | past | comments | ask | show | jobs | submit login
Scientists rise up against statistical significance (nature.com)
618 points by bookofjoe on March 20, 2019 | hide | past | favorite | 249 comments

Maybe I'm just being jaded, and I'm certainly not a researcher or statistician, but I don't see how removing "statistical significance" from scientific nomenclature is going to prevent lazy readers (or science reporters) from trying to distill a "yes/no" or "proven/unproven" answer from P values listed in a complex research paper.

Well, that's why the article doesn't propose simply ditching P-values, it proposes reporting confidence intervals instead. Not only to they provide more information (by simultaneously conveying both statistical and practical significance) they're also easier to interpret correctly without special training.

> they're also easier to interpret correctly without special training.

Heh, you'd be surprised! Most people I met would interpret a 95% CI by saying that there is a 95% chance that it contains the true mean.

Is this not the definition of "confidence interval"? The first few Google results all define it this way...

Unless I've forgotten more than I hope, I believe the formal definition of a 95% confidence interval is that "if the model is true, 95% of the experiments would result in a point estimate within the interval." This is distinctly different from "a 95% probability that the true value is contained within the confidence interval", but that is typically what is loosely inferred.

It is distinctly different, but I have to think really hard to follow the practical difference between "If we ran 100 more identical experiments, 95 of them would estimate a population mean result somewhere in this range", and "There's a 95% probability that the population mean falls somewhere in this range."

The thing that makes it hurt is that what seems like a minor rewording of that statement, "We're 95% confident that the range we calculated contains the true population mean," would be correct. (It's not the definition of a CI, but it is implied by the definition of a CI.) And that the reason why one is true and the other is false is down to the technical distinction that the true population mean is a discrete value and not a draw from a random variable, so, from a strictly mathematical sense, it is nonsensical to apply probabilities to it. By a the same token, an even more subtle rewording gets us back to falsehood: "There's a 95% chance that the range we calculated contains the true population mean."

Which, delving into to that level of hair-splitting makes for interesting math, but also leaves me with the opinion that, sure, the most common intuitive interpretation is wrong, but it's wrong in a way that isn't really of much practical importance.

By contrast, the most common intuitive interpretation of the p-value is disastrously wrong.

Nope. It is (from a frequentist's model of statistics), exactly what the article is claiming it isn't: If the model is true, and we repeat the experiment several times, 95% of the intervals we calculate will contain the true value. The actual CI you get in each experiment will differ.

Another discrepancy between frequentists statistics and the article is that yes, the values at the boundary of your interval are as credible as in the center.

This is the first post I've seen that states the definition of a CI correctly.

Another note: 'confidence interval' typically refers to the frequentist meaning, whereas 'credibility interval' is used in the Bayesian setting, when describing an interval of the posterior with 95% probability (which is arguably more interpretable). The usages of the two terms do not seem to generally be strict, however.

> the values at the boundary of your interval are as credible as in the center.

What would that mean in the frequentist framework?

To be explicit, and using an example similar to the one in the article, if your CI is (2, 40), with the center being 21, there is no reason to believe that the true value is closer to 21 than to, say, 3.

To provide an extreme case, during the Iraq war, epidemiologists did a survey and came up with an estimated number of deaths. The point value was 100K, and that's what all the newspapers ran with. But the actual journal paper had a CI of (8K, 194K). There's no reason to believe the true value is closer to 100K than it is to 10K. Or to 190K.

You're right, from the frequentist definition of a confidence interval (2,40) we can't say the the true value is more likely to be closer to 21 than to 3.

But we can't neither say that the true value is equally likely to be closer to 21 than to 3.

The point is that, from the frequentist definition of a confidence interval, there is nothing at all that we can say about how likely the true value is to be here or there.

It could be 3, 21, or 666 and there is nothing that can be said about the likelihood of each value (unless we go beyond the frequentist framework and introduce prior probabilities).

>The point is that, from the frequentist definition of a confidence interval, there is nothing at all that we can say about how likely the true value is to be here or there.

Yes - sorry if I wasn't clear. I did not mean to imply that each value in the interval is equally likely (and looking over my comments, I do not think I did imply that).

The complaint is that the article is stating otherwise as fact.

>One practical way to do so is to rename confidence intervals as ‘compatibility intervals’

>The point estimate is the most compatible, and values near it are more compatible than those near the limits.

They simply are not in a frequentist model (which is the model most social scientists uses). I agree with the main thrust of the article in that there are many problems with P values. But I am surprised that a journal like Nature is allowing clearly problematic statements like these.

I don't know enough about the Bayesian world to be able to state if his statement is wrong there as well, but if it is correct there, it is problematic that the authors did not state clearly that they are referring to the Bayesian model and not the frequentist one.

(Not to get into a B vs F war here, but I remember a nice joke amongst statisticians. There are 2 types of statisticians: Those who practice Bayesian statistics, and those who practice both).

> I did not mean to imply that each value in the interval is equally likely (and looking over my comments, I do not think I did imply that).

When you said that "the values at the boundary of your interval are as credible as in the center" you kind of implied that, which is why I asked.

I won't defend the article being discussed, but you opposed their statement that "the values in the center are more compatible than the values at the boundary" with an equally ill-defined "the values at the boundary are as credible as in the center".

I do not read my statement to imply uniform distribution.

What I meant was "there is no reason to prefer values at the center more than values at the boundary" based on the CI (there may be external reasons, though). To me, this is equivalent to your:

>there is nothing that can be said about the likelihood of each value

Ok, we agree. "As credible as" implies uniform "credibility". "More compatible than" implies non-uniform "compatibility". Without any clear definition of "credibility" or "compatibility" it's impossible to interpret precisely what are those claims supossed to mean.

> there is no reason to believe that the true value is closer to 21 than to, say, 3.

I find this very silly, since if we ditch the arbitrary 0.95 and go with 0.999.. confidence interval of [-998, 1040] for example. How can one say that one cannot tell if which value is more likely, 21 or 1040?

If this is an actual limitation of the frequentist model like you said, everybody should be a bayesian thinker then. And the "confidence interval" is just a quick way to communicate how wide and where the posterior bell curve is.

Confidence intervals can be applied to point estimates, estimates of means, and estimates of other things, including higher order moments.

The difference is hugely important, the central limit theorem and the implication of a normal distribution commonly applies to sample means.

You can calculate confidence intervals for most (not all) other statistics, like point estimates, but the distributions might not be normal.

"for most (not all)" - yes, if analytically. If you can afford bootstrapping, then it is just "for all".

Technically no for the standard frequentist confidence intervals, but if they use the Bayesian Credible Interval then I believe that would be the correct interpretation.

This one of many Bayesian vs. frequentist blog posts where the frequentist example is presented in such bad faith or is so wrong that it's impossible to take seriously. Why is the sample mean used for the frequentist CI when it is not a sufficient statistic, and especially since it appears after the section discussing a "common sense approach" in which the author does mention a sufficient statistic: min(D)? All this blog post shows is reasonable bayesian approaches are better than frequentist approaches where common sense isn't allowed.

Unless I'm missing something the author answers that here:

> Edit, November 2014: ... Had we used, say, the Maximum Likelihood estimator or a sufficient estimator like min(x), our initial misinterpretation of the confidence interval would not have been as obviously wrong, and may even have fooled us into thinking we were right. But this does not change our central argument, which involves the question frequentism asks. Regardless of the estimator, if we try to use frequentism to ask about parameter values given observed data, we are making a mistake.

In other words, the author has no mathematical examples to support the argument, and the objection is purely philosophical...

Not as I understand it. Note that the author's argument there isn't "frequentism is bad because it gives an unreasonable answer here", it's "the fact that frequentism gives a different answer here demonstrates that it really is answering a different question".

But the frequentist answer is only different when the frequentist can't use common sense. If you use min(D) as the frequentist estimator, you would get a very different confidence interval, as it would have the form [min(D) - constant, min(D)]. The CDF of the truncated exponential is F(x) = 1-exp(theta-x), and the CDF of the minimum of three samples is 1-(1-F(x))^3. I get that the frequentist 95% CI is [9.00142, 10], which for all intents and purposes is the same as the credible interval the author computes.

I agree that credible intervals and confidence intervals answer different questions. I don't think that it's obvious that the confidence interval approach is wrong, and the example in the blog post is definitely not evidence towards this.

A 95% confidence interval will contain the true mean 95% of the time (across an infinite number of replications of the experiment/study). For a single confidence interval, you have either captured the mean in your confidence interval, or you've not -- there's no probability about it.

I believe this is the correct frequentist interpretation. To quote wikipedia:

> A 95% confidence level does not mean that for a given realized interval there is a 95% probability that the population parameter lies within the interval (i.e., a 95% probability that the interval covers the population parameter).[10] According to the strict frequentist interpretation, once an interval is calculated, this interval either covers the parameter value or it does not; it is no longer a matter of probability.

This is where I get lost:

> For a single confidence interval, you have either captured the mean in your confidence interval, or you've not -- there's no probability about it.

Isn't there? The underlying truth is that you either definitely have or have not captured the population mean in any specific confidence interval. But you can't know this truth. In the long run, if "a 95% confidence interval contains the true mean 95% of the time across an infinite number of replications of the experiment/study," then isn't it true that any single specific experiment's CI has a 95% probability of containing the true value?!

In my untrained mind, this is exactly equivalent to flipping an unfair coin with a 95% chance of heads. Sure, before flipping, the outcome of heads has a 95% probability. After flipping, you either get heads or tails. But if you flip a coin and hide the outcome without looking at it, doesn't it still have a 95% chance of being heads as far as the experimenter can tell?

That cannot be the result of an experiment without some sense of the prior probability, and I don't think CIs as suggested in the article account for that.

For example, if I perform an experiment on a light source and find a 95% CI of 400-1200 lumens for the brightness, the actual probability of this being true is much higher if the light source is a 60W incandescent bulb than if the light source is the sun.

except in real experimentation you control for all the variables except the one you're investigating. So it is not a very usefull thing to compare a bulb with the Sun.

Instead, if you state that "luminescence OF THE 60W INCANDESCENT LIGHTBULB is 400-1200 lumens with 95% CI", then that's the usefull information that let's you set the right expectations when designing lightning for your new house, for example.

I think you missed the point of my example. I was suggesting that an experiment performed on the sun showing a 95% CI of the brightness being 400-1200 lumens should result in a reasonable person believing that the probability of the Sun's brightness falling in that range is approximately zero, while the same result for a 60W light-bulb should result in a reasonable person being more than 95% certain that the bulb's brightness falls in that range.

Just like a large number of people misinterpret a P value of .01 to mean a 1% chance of the results being due to chance[1], CIs can be similarly misinterpreted.

1: A .01 P value actually means that if the null hypothesis is true, then you would get the result 1% of the time. The analogy to my above example would be that if I run an experiment and get a result that "the sun is less bright than a 60W light-bulb" with a P value of .01, it's almost certainly not true that the sun is less bright than the light-bulb, since the prior probability of the sun being less bright than a 60W light bulb is many orders of magnitude smaller than 1%.

To see how absurd that definition is, think about this - CI itself is random! So if you conduct a 100 experiments, you'll get a 100 (non-overlapping) CI's! So in which of those 100 CI's does the "true population mean" lie ? All 100 of them ? 95 of them ?! You tell me.

>So if you conduct a 100 experiments, you'll get a 100 (non-overlapping) CI's! So in which of those 100 CI's does the "true population mean" lie ? All 100 of them ? 95 of them ?! You tell me.

I don't know why you get the idea that all 100 will be non-overlapping. That's simply false.

And yes, if your assumptions were correct, regular (i.e. frequentist) statistics will state that roughly 95% of the CIs will contain the true mean. There is nothing absurd about it.

Speaking as a Stat TA, literally over 90% of the students taking the class will conduct 1 experiment, not 100, which gives you 1 CI, not 100, & then say that particular CI has a 95% chance of containing the population mean! Then when I tell them the mean is either in that CI or not ( either 100% in, or 0% in ), they google the CI definition & point me to that. That's why I said that definition doesn't work for the masses. It can be interpreted as - "if you conduct a 100 experiments & get 100 CI's then roughly 95 of those will contain true mean", but then nobody ever conducts 100 experiments, so from their pov its an absurd definition.

Its imperative to understand that these definitions are written not for the average user of statistics, but for a trained statistician. Unfortunately, the average stat consumer vastly outnumbers the professional. Papers are littered with statements like p value proves H0, or proves H1. I have had numerous conversations with scientists ( not statisticians, but pharma/epidem/engg people who show up to the stat lab for consult ) that their p value doesn't prove H0 or H1. "What do you mean you can't prove H1 ? Oh you mean it only rejects H0 ? Ok but isn't that same as prove H1 ? It isn't ?! Well in my field if I just state it rejects H1 it won't be well understood so I am going to instead say H1 has been proved!" So there's little the statistician can do.

Regards overlap, I meant total/exact overlap, as in no two CIs will be identical on any conti dist.

>Speaking as a Stat TA, literally over 90% of the students taking the class will conduct 1 experiment, not 100, which gives you 1 CI, not 100, & then say that particular CI has a 95% chance of containing the population mean!

And as a TA, I hope you are marking them wrong! My stats professor went through great pains to point it out, as does my stats textbook.

>but then nobody ever conducts 100 experiments, so from their pov its an absurd definition.

The definition is not absurd. It's just a definition. You can argue that CI's are being abused and are not as helpful as people think they are and I'll probably agree with you. But that's no excuse for people to use it and get a pass for not even knowing what it means!

I contend that it was defined as such because at the time no one had anything better. I suspect that much is true even now. I'm not aware of any obviously better alternatives, and articles like these even suggest there aren't any - just that we need to be mindful that the CI alone doesn't allow for reliable conclusions.

At the end of the day, this is not a problem with the CI definition. It's not a statistics/technical problem. It's a social/cultural one. As such, the solution isn't to change statistics, but to fix the cultural problem: Why do we keep letting people get away with such analyses? Are there any journals that have a clear policy on these analyses? Are referees rejecting papers because of an over reliance on p-values?

Let's not change basic statistics definitions and concepts because the majority of non-statisticians don't understand them. When the majority of the public can't understand basic rules of logic (like A => B does not mean that (not A => not B)), we don't argue for a change to the discipline of logic. When huge numbers of people violate algebra (e.g. sqrt(a^2+b^2) = a + b), we don't blame mathematics for their sins. (I know I'm picking extreme cases for illustration, but the principle is the same). I had only one stats class in my curriculum. If you want people to perform correct statistics as part of the profession, make sure it is fundamental to much of their work. It would have been trivial to add a statistics component to most of my engineering classes, and that would hammer in the correct interpretation. Yet while we were required to know calculus and diff eq for most of our classes, none required any statistics beyond the notion of the mean (and very occasionally, some probability).

Statistics is a tool. It will always be the responsibility of the person invoking the tool to get it right.

>Regards overlap, I meant total/exact overlap, as in no two CIs will be identical on any conti dist.

Isn't that the whole point of inferential statistics? You have a population with a true mean. You cannot poll the whole population. Hence you take a sample. This is inherently random. There is variance in your estimate (obvious), What should be clear is that the CI should move with your point estimate. Furthermore, you never know the true stddev, so you estimate the stddev from your sample. Now both your center and the width of the CI will vary with each sample. I can't comprehend how you could hope to get the same interval from different samples, given that it is quite possible to get all your points below the mean in one sample and above the mean in the other.

I think people are bashing statistics because it isn't helping them come to a clear conclusion (which is fair). But as I said, all the proposals I've seen appear to be as problematic.

Thats a ridiculous assumption because I have seen over and over again people not understanding what confidence intervals actually mean. They look intuitive but they really are not as simple as they seem.

Sure, but people are not necessarily scientists.

Well, no. Hardly any non-statistician knows what a CI really means. It's also just a statement about significance in disguise.

> "Well, that's why the article doesn't propose simply ditching P-values, it proposes reporting confidence intervals instead."

You could possibly enforce that when publishing in journals, but there is no way in hell "science journalism" would follow suite, so the general public would still be just as hopelessly mislead, if not moreso if we account for the general statistical incompetence of the typical science journalist.

If you eliminate p-value then you can't have authors that search for anything with p<0.05 and then publish, there will simply have to be some other justification. If p-value is gone it will have to be replaced with something and that something, the supposition is, will result in better science.

Writing a paper, you need to support your conclusion, removing p-value doesn't remove that need for support, it will just find something different, hopefully better.

This could make it worse. Instead of hacking for 5%, maybe they’ll hack for the lowest possible.

To be clear I mean rejecting p-values all together (or at least requiring additional evidence) not the specific <0.05 requirement.

For some context in the article they do not call for abolishing p-values but for a stop of the "significant" false dichotomy. An example would be to explain the consequences of all the values in the confidence interval or even to simply reformulate a sentence from "no significant effect was found" to "our data neither prove or disprove the presence of a significant effect"

Described pretty accurately by this XKCD


The proposal isn't about eliminating p.

"Lazy reader" is not the audience for scientific papers.

Protecting against misinterpretations by outsides is not something that scientific research papers should worry about.

"Lazy reader" is precisely the audience for scientific papers.

That's the gist of the crisis science has been having of late: The popular meme for a long time was that it was just non-scientists who would misinterpret the research. Recently, it has become clear that misinterpreting papers also rampant among trained scientists, to the extent that entire fields are being shown to be houses of cards.

> "Lazy reader" is precisely the audience for scientific papers


There are plenty of personalities and pundits that love to say "look at the data" or "the data is clear".

It bugs me when data is weaponized as truth to prove a conjecture, especially in the social sciences where studies are routinely difficult to replicate with consistent results.

It's not just misunderstanding papers, the entire social science field is in a replication crisis. A lot of people don't really want objective science and would rather push their agenda. Either that or they're simply incompetent.

If anything, getting rid of some term because we currently find it insufficient to convey what we mean by it will, in all likelyhood, open the race for far more confusing, rosy and well-meaning, yet more meaningless nomenclature.

By that measure, I find this entire idea rather idiotic.

> getting rid of some term because we currently find it insufficient to convey what we mean by it will, in all likelyhood, open the race for far... more meaningless nomenclature

You have given no reason to believe this.

There's every reason to believe this. The article makes it very clear that the problem is the need by journals, funding agencies and researchers to have a boilerplate value so they can categorize the results as "true" or "false". You can change the value or the phrasing, but without solving the problem of "laziness" you won't fix anything.

Note that there are already journals that effectively do this (it is very hard to get a p-value put into the journal Epidemiology for example) and as far as I can tell, there are few if any negative repercussions evident.

Although my failure to see an elephant on the table does not rule out completely that there could be an elephant there, it does limit the possible size of the elephant to a few micrometers. Failure to reject the null hypothesis does in fact provide evidence against the other possibilities, so long as "other possibilities" are understood to mean "other possibilities with big effects."

I don't see why a scientist at a conference who's saying that two groups are the same has to be heard as claiming, "we have measured every electron in their bodies and found that they have the same mass, forget about six sigma, we did it to infinity." Instead they could simply be understood to be saying that the two groups must be similar enough to not have ruled out the null hypothesis in their study.

That's the thing. P values don't prove that anything must be. They simply say that if rerunning the experiment again, it would be surprising to get a different result. Conversely, if you don't find "statistical significance" it definitely doesn't mean there isn't a difference. In practice, it might (often) mean the study didn't have enough samples to find a relatively small effect, but the layperson making decisions (do I allow right turn on red or is that dangerous?) may not get that nuance. A book that really helped clarify my thinking on this is _Statistics Done Wrong_ by Alex Reinhart.

Edit: remove "interpret" from last sentence to clarify

Yes, this immediately reminded me of this book. To add an example (from the book): after implementing turn-on-red in a few places, a study was conducted to check if it increased the rate of accidents. The study found no statistically significant effect, so turn-on-red was rolled out statewide. Unfortunately, the study failed to detect an effect not because there was no effect, but because the sample was too small to confirm that the effect was statistically significant. Now that we have more data, it turns out that turn-on-red does increase the rate of accidents.

Read it in full here: https://www.statisticsdonewrong.com/power.html#the-wrong-tur...

> They simply say that if rerunning the experiment again, it would be surprising to get a different result.

Not really. A low p-value says that it was surprising to get the result that you got, assuming that the null hypothesis is true. And if the null hypothesis is true it would be surprising to get again the same result (i.e. a result as extreme). If the null hypothesis is not true, the result would not be so surprising (or maybe more, if the true effect is in the “wrong” direction).

The result we got gives some evidence for the null hypothesis being false, but if the null hypothesis was very very likely to be true before it may still be very likely to be true afterwards. In that case it wouldn’t be surprising to get a different result if the experiment is performed again.

Illustration: I roll a die three times. I get three ones. P<0.01 (for the null hypothesis of a fair die and the two-tailed test on the average). This is not simply saying that if I roll the die three times again it would be surprising to get something other than ones.

I roll a die three times. I get three ones. P<0.01 (for the null hypothesis of a fair die and the two-tailed test on the average).

Hmm. At a glance, that doesn't seem right. Yes, the chances of rolling 3 1's is 1/(6^3), but if we only rolled once and got a single 1, we wouldn't have any reason to suspect that the die was unfair. So maybe we should only consider the second two repetitions, and conclude with p ~ .03 that the die is unfair? Otherwise, consider the case that we rolled a 1, 5, 2 --- certainly we shouldn't use this series of non-repeated outcomes as p < .01 evidence of an unfair die?

If the die is fair, the average score will be 3.5. One can define a test based on that value and reject the null hypothesis when the average score is too low or too high.

The sampling distribution for the average can be calculated and for three rolls the extreme values are 1 (three ones) and 6 (three sixes) which happen with probability 1/216 each. Getting three ones or three sixes is then a p=0.0093 result.

You raise a valid point. This is clearly not the best test for detecting unfair dice, because for a die which has only two equally probable values 3 and 4 we would reject the null hypothesis even less often than for a fair die! (In that case, the power would be below alpha, which is obviously pretty bad.)

> They simply say that if rerunning the experiment again, it would be surprising to get a different result.

P values don't tell you the chances of getting the same result, only with the chances of getting the same result by chance.

That's not what OP is saying though. Let's say p=.001 - That means we're confident the results are not due to chance.

If we repeat the experiment and get a different result, then we need to be looking into confounding variables and testing methodology. Just because the P-value is low doesn't mean there's no fundamental flaw in the experiment used to find the p-value.

P values don't talk about the result, they only talk about the probably of the null hypothesis being correct.

You're only allowed to say either nothing happened or it's unlikely nothing happened, but it doesn't say what the "unlikely nothing happened" means.

> P values don't talk about the result, they only talk about the probably [probability?] of the null hypothesis being correct.

I would say it’s the other way around.

The p-value says something about the result (how likely this result would be if the null hypothesis were true).

It doesn’t say anything about the probability of the null hypothesis being true.

> how likely this result would be if the null hypothesis were true

It's how likely the result is to occur by chance if the null hypothesis is true. A positive result can occur for lots of other reasons even if the null hypothesis is true, and the p-value doesn't tell you anything about how likely you are to get a certain result if the null hypothesis is true (or false).

> It's how likely the result is to occur by chance if the null hypothesis is true. A positive result can occur for lots of other reasons even if the null hypothesis is true, and the p-value doesn't tell you anything about how likely you are to get a certain result if the null hypothesis is true (or false).

I don’t think your comment makes sense.

Given a parametric model and a particular value of the parameter (i.e. the null hypothesis) one can calculate the sampling distribution of the data.

Therefore under the null hypothesis the model gives a well-defined probability distribution for the data and you can tell how likely you are to get a certain result.

There is no room for “other reasons”.

Let's imagine the simplest case where the scientist never actually ran the experiment in the first place and just made up their data.

Looking at the p-value of a study doesn't tell you anything about how likely the study was to have been based on fabricated data.

When we say “how likely you are to get a certain result if the null hypothesis is true”, one should understand “the null hypothesis is true” as “the data is generated by a process perfectly described by the model, including a particular value for the parameter”.

I agree that if the data is made up the results of the study and the statistical analysis based on the results will have no relation whatsoever with the fact that the null hypothesis was or wasn’t true.

The p-value tells you just how likely you are to get a certain (or more extreme) result if the data generating model is indeed correct and the null hypothesis is true.

We agree that the p-value doesn’t tell you anything about how likely it is that the study was based on fabricated data, or how likely it is that the model is correct or how likely it is that the null hypothesis is true.

The p-value doesn’t tell us anything about the real world. It’s a probability conditional on a hypothetical model.

> They simply say that if rerunning the experiment again, it would be surprising to get a different result.

That interpretation subtly suggests that the P-value gives an estimate of the likelihood of the alternative hypothesis being true.

It doesn't limit the size of the elephant.

Any observation is consistent with an infinite number of models. Eg., your sight is defective, ie., in many cases: your sample is biased, not big enough, etc.

And that A correlates with B, or fails to, to "some significance" is consistent with any causal relationship between A and B.

> Surveys of hundreds of articles have found that statistically non-significant results are interpreted as indicating ‘no difference’ or ‘no effect’ in around half (see ‘Wrong interpretations’ and Supplementary Information).

>Eg., your sight is defective, ie., in many cases: your sample is biased, not big enough, etc.

You can't expect a statistical test to comb through your experimental procedure to check whether or not you did it right. (Least of all, to comb through your personal medical history. ;) )

>Any observation is consistent with an infinite number of models.

I'm afraid that slips below the "get out of bed in the morning" pragmatism bar. Although it's a provocative idea, concluding that conclusions are impossible because you could be making a mistake or hallucinating would lead to, literally, not getting out of bed in the morning, if you took it seriously.

At the end of the day you just have to hope that the human beings reading your paper will be able to tell whether or not you did the experiment right. Any statistical test just walls off one way (or a few ways) in which you could fool yourself.

>Surveys of hundreds of articles have found that statistically non-significant results are interpreted as indicating ‘no difference’ or ‘no effect’ in around half (see ‘Wrong interpretations’ and Supplementary Information).

What we really want to avoid here is requiring every paper to have a twelve paragraph Sophist disclaimer about how knowledge is impossible. ;) Just because the text doesn't say "oh and by the way there could be gnomes in the LHC that are fooling us to entertain themselves," doesn't mean that the author wouldn't begrudgingly acknowledge it as an epistemic risk.

Well your "pragmatism bar" is higher that all future science.

It isnt a trivial sceptical point.

Does CO2 cause climate change, or climate change cause C02, or are they both caused by a hidden factor (one of many) or do they co-vary accidentally (ie., no joint cause)?

No "statistical significance" of any kind can decide between them, yet this is how it is used. There is no way to do so with statistics; it requires repeated experimentation and refutation of all plausible causal models until the one remains stands as "the most plausible".

"Statistics" is being used in place of the scientific method which it cannot replace. Covariation is not causation: the covarying of any number of variables is consistent with any number of causal models. No "test" decides.

The evidence for anthropogenic climate change involves a lot more than correlation, so maybe that isn't the best example. In any case, even if everyone on Earth is misinterpreting p-values, that's not a problem with p-values.

The point of statistics isn’t to attain absolute certainty or for the statistical test to make a decision.

The point is for people to make a decision using probability as one available decision aid, and then deciding whether to place a bet.

"concluding that conclusions are impossible because you could be making a mistake or hallucinating would lead to, literally, not getting out of bed in the morning, if you took it seriously."

Not if you don't have free will, which is apparently the conclusion that a lot of scientists seem to be coming to recently.

Haha, then allow me to rephrase. A brain whose molecules react that conclusions are impossible will no longer chemically get out of bed in the morning.

Why can't you act in ways that are at odds with your conclusions?

When I look around me I see people doing this every day.

This is why we report statistical power. It does in fact put probabalistic limits on the size of the elephant. In effect, we can make a statement like "at least 90% of studies like ours will detect an elephant larger than 20 microns if it is actually there on the table". It's not reasonable to interpret the results of a single study without consideration of power.

> It does in fact put probabalistic limits on the size of the elephant.

Emphasis added (obviously). This word was missing from the analysis of whatshisface (GGP). While all of our models are probabilistic, and it would be silly to constantly describe everything this way in normal parlance, a discussion of how we want to go about building and interpreting models of the world is exactly the domain where this needs to be explicitly stated. You're not wrong here, but neither is GP. GGP is wrong, though not provably so (again, obviously).

> In effect, we can make a statement like "at least 90% of studies like ours will detect an elephant larger than 20 microns if it is actually there on the table".

The "in effect" disclaimer makes this statement arguable, but fundamentally we still can't say that. We can only say whether or not an elephant was detected, since detection (in this sense) is subjective. We can't say whether elephants actually exist or not. Perhaps all tables have invisible elephants on them, or perhaps all elephants are hallucinations shared by multiple humans through crazy coincidence. If 50% of elephants that actually exist on tables are invisible, and 10% of visible elephants are not detected due to error, then only 45% of studies will detect an elephant even though it will still seem like 90% to best of our knowledge. Since we don't know what percentage of elephants are currently detectable we need to be even more ridiculous and say something like "we guesstimate that there is a 99ish% chance that at least 90% of studies like ours will detect an elephant larger than 20 microns if it is actually there on the table".

Edit: s/whathisface/whatshisface

Yes, okay. There are implicit assumptions in scientific studies ( e.g.: there are no invisible elephants, or the study we are doing is actually related to the question we are investigating! ). Power calculations refer to the model and not to whether it is the correct model. To some extent we routinely worry about certain types of "hidden from observation" problems: we have zero-inflated poisson models, or we worry about what happens if there is a limited susceptible subpopulation ( that could deplete over time differentially depending on treatment, etc ). But it is not correct to suggest that if a study of whatever power does not find an effect, then a huge effect and no effect are equally plausible.

I have seen outrageous examples of "accepting the null hypothesis" many times, but many negative result studies have great value and even a single negative study can provide evidence against a very large effect.

* There are implicit assumptions in scientific studies ... Power calculations refer to the model and not to whether it is the correct model.*

True, but this crucial assumption needs to be kept in mind when using the study to inform yourself of the state of the world. Too often, it's forgotten that there are two questions that need answering for the study to be relevant and, only one of those has a number associated with it.

In that sense, Bayesian statistics, where this is explicit, are less misleading because they actually draw attention to the fact that we don't know that the model is correct.


Interesting point about Bayesian methods, whenever there are potential flaws or additional uncertainty, it's better if they are more explicit to prompt thoughtful interpretation.

It's still really convoluted and removed from the actual elephant. From a frequentist perspective, the elephant either is or is not on the table, and if it is then it has a certain size - not random quantities subject to probabilistic statements.

If you're willing to treat probability as a measure of uncertainty, then you'd be right at home as a Bayesian.

This thread is a prime example of the danger of p-values, which measure the likelihood of the data if you assume the null hypothesis. This is very different than the probability that the hypothesis is true.

Statistical significance doesn't communicate expected bounds on possible effect size though. If the two sigma bounds are hundreds of meters, the failure to see an elephant in the table is completely meaningless. On the other hand, if it's a few micrometers, that tells you a lot.

Methods like gauge R&R can be used to help decide if the measurement system being used in an experiment is sensitive enough to detect an elephant of a given size.


Yes, failure to rule out an elephant means you probably didn't look hard enough (collect enough data).

But successfully ruling out an elephant is uninteresting if you didn't expect an elephant. The problem is that "statistically significant" sounds impressive, but we shouldn't be impressed.

I guess we need a less impressive term for this? Maybe something like "may have avoided statistical blindness."

I've been suggesting "null-improbable". I'd be happier still if people stopped dichotomizing on this concept, though.

>They could simply be understood to be saying that the two groups must be similar enough to not have ruled out the null hypothesis in their study

The groups need not be similar to fail to rule out the null. It can also be that the measurements are too noisy and too few.

Also on the flip side, if you do reject the null, it doesn't mean the the groups are different. It could also be that you have so many measurements that you are picking up tiny biases in your experiment or instruments.

Null hypothesis testing is almost always too weak of a test to be useful.

To add to this, the main problem with null hypothesis testing IMO is that mathematically you are comparing a infinitesimally small hypothesis: effect exactly equals 0.000000000000... to an infinitely large one: effect is somewhere between 0.00000...00001 and infinity.

A tiny effect is very easy to reject (p<0.0000...000001). I can tell you before running the experiment that any two objects close enough for long enough to be in each other's light cone have at least a tiny effect on each other.

The null hypothesis is hereby rejected for most anything relevant to humans. No need to calculate p-values. Just reference this comment.

Sorta, but people have taken that reasoning and (by this analogy) begun looking for elephants with their eyes closed. In that case, the failure to see an elephant doesn't actually tell you much of anything.

I predict nothing will change. Flaws in p-values and confidence intervals have been apparent since almost their inception. Jaynes spoke out against it strongly from the 60's on (see, for example, his 1976 paper "Confidence Intervals vs Bayesian Intervals"). Although I can't find it right now, there was a similar statement about p-values from a medical research association in the late 90's. It's not just a problem of misunderstanding the exact meaning of p-values either. There are deep rooted problems like optional stopping which render it further useless.

The problem is that with all its problems, statistical significance provides one major advantage over more meaningful methods: it provides pre-canned tests and a number (.05, .01, etc) that you need to 'beat'. The pre-canned-ness/standardization provides benchmarks for publication.

I once worked in a computational genomics lab. We got a paper into PNAS by running fisher-exact test on huge (N=100000+) dataset, ranked the p-values, got the lowest p-values, and reported those as findings. There's so much wrong with that procedure its not even funny.

Hippocratic medicine lasted well into the 19th century, centuries after the scientific revolution. There'd been critics correctly calling it an intellectual fraud before then. You could've taken this as proof that no force on Earth could drag medicine into modernity, but it did sort of happen, as it became public, common knowledge that doctors were harming more people than they helped. They did start cleaning up their act (literally) though it took a long time and I think they're still collectively irrational about chronic conditions.

I hope we aren't worse at reform than they were in the 1800s.

Working in the field, it is getting better. It's slow, but getting better.

As I recall, instead of "compatibility intervals" (or confidence intervals), other gainsayers of P tests have proposed simply making the existing P criterion more selective, like a threshold value of .01 rather than .05, which equates to increasing the sample size from a minimum of about 10 per cohort to 20 or more.

I suspect this will be the eventual revision that's adopted in most domains, since some sort of binary test will still be demanded by researchers. Nobody wants to get mired in a long debate about possible confounding variables and statistical power in every paper they publish. As scientists they want to focus on the design of the experiment and results, not the methodological subtleties of experimental assessment.

Raising threshold will not just reduce probability of false positive result, but also will raise probability of false negative. Social sciences are dealing with a complex phenomena and it maybe that there are no simple hypothesis like A -> B, that describes reality with p<0.05. While in reality A causes B, just there are C, D, ..., Z, and some of them also causes B, others works other way and cancel some of others. And some of them works only when Moon is in the right phase.

p<0.01 is good when we have a good model of reality which generally works. When we have no good model, there are no good value for p. The trouble is all the hypotheses are lies. The are false. We need more data to find good hypotheses. And we think like "there are useful data, and there are useless, we need to collect useful, while rejecting useless". But we do not know what data is useful, while we have no theory.

There is an example from physics I like. Static electricity. Researchers described in their works what causes static electricity. There was a lot of empirical data. But all that data was useless, because the most important part of it didn't get recorded. The most important part was a temporality of phenomena. Static electricity worked some time after charging and then discharged. Why? Because of all the materials are not a perfect insulators, there was a process of electrical discharge, there was voltage and current. It was a link to all other known electical phenomena. But physicists missed it because they had no theory, they didn't knew what is important and what is not. They chased what was shiny, like sparks from static electricity, not the lack of sparks after some time.

We are modern people. We are clever. We are using statistics to decide what is important and what is not. Maybe it is a key, but we need to remember that it is not a perfect key.

> Raising threshold will not just reduce probability of false positive result, but also will raise probability of false negative.

And this article emphasises that huge numbers of scientists and their audiences cannot correctly interpret negatives (false or otherwise).

The negative means: "Our data does not appear to show any correlation that met the criteria we chose to determine statistical significance"

This negative is frequently misinterpreted as: “We have shown with acceptable confidence that there is no correlation". (which does not inherently follow).

Therefore, adjusting the common p value threshold to P<0.01 while not correcting this widely-held cognitive error could (potentially) even worsen the problem, because people will encounter proportionally even more of the 'negative' results that they are already well-established as poor interpreters of.

> The trouble is all the hypotheses are lies

The problem with p values is the opposite: in a complex system, null hypothesis is almost never true. Everything affects everything else. It's the magnitude of the effect that is important, because it can be too small to have practical implications.

If you do enough studies eventually you'll find a result with p value below 0.05 that will overestimate the magnitude of the effect by a lot, and publish it.

In social sciences if the effects aren't clear and readily apparent without quibbling over whether p<0.05 or p<0.01 is the right standard then perhaps the whole thing is a waste of time. If our experimental techniques are insufficient for dealing with multiple factors and complex webs of causality then why bother?

How clear or important the effect is has nothing at all to do with a p-value. I can have p < 10^(-10) and still have an effect that is so weak that it's meaningless. The confusion you're having so pervasive and a big part of the problem. You have to use effect sizes in order to measure this.

No I'm not confused by the distinction between p-values and effect sizes, which is why I intentionally used the phrase "clear and readily apparent".

Science is a coordinated effort of many people. Scientists needs ability to communicate any partial knowledge they found.

I think you're using a strange definition of p<x here. p<0.01 does not mean that the phenomena is perfectly understood, merely the likelihood that some independent variable affects some dependent variable in some direction.

For example, global weather is a hugely complex system which is far from fully understood. Let's take an incredibly simple predictive model of temperature: Still, even with the most simplistic model, say, predict the current temperature by just predicting the average temperature for the current season. That's not going to be very accurate, but if you run it for a year it will almost certainly be p<.01.

> Raising threshold will not just reduce probability of false positive result, but also will raise probability of false negative.

Not really. You raise the probability of an inconclusive result when you could otherwise have gotten a positive. If you interpret p > threshold as “null hypothesis is true”, then you are doing the statistics wrong.

In most cases, I think a better model would be to extract an effect size such that an effect larger than the size is ruled out by the study to some degree of confidence. Currently, I read about studies that conclude that “such-and-such had no significant effect detected by this study.” Concretely, this looks like “vaccines had no significant effect on autism risk.” This may be accurate, but it’s lousy. How about “vaccines caused no more than an 0.01% increase in autism, and a bigger study could have set an even tighter limit.”

Physicists regularly do this. For example, we know that the universe has no overall curvature to a very good approximation.

We already see this happening in the clinical literature though. Effect estimates that show a positive clinical outcome, but have confidence intervals that barely brush the null are described as having "no impact".

The example from the article about the two drug studies seems to indicate that would not be useful.

> For example, consider a series of analyses of unintended effects of anti-inflammatory drugs2. Because their results were statistically non-significant, one set of researchers concluded that exposure to the drugs was “not associated” with new-onset atrial fibrillation (the most common disturbance to heart rhythm) and that the results stood in contrast to those from an earlier study with a statistically significant outcome.

> Now, let’s look at the actual data. The researchers describing their statistically non-significant results found a risk ratio of 1.2 (that is, a 20% greater risk in exposed patients relative to unexposed ones). They also found a 95% confidence interval that spanned everything from a trifling risk decrease of 3% to a considerable risk increase of 48% (P = 0.091; our calculation). The researchers from the earlier, statistically significant, study found the exact same risk ratio of 1.2. That study was simply more precise, with an interval spanning from 9% to 33% greater risk (P = 0.0003; our calculation).

> Nobody wants to get mired in a long debate about possible confounding variables and statistical power in every paper they publish

I know, right? God forbid that we take a close look at the ways we might be fooling ourselves. Sounds like hard work.

The OP spends some time on a point that the threshold is fairly arbitrary, and the problem is misinterpreting what it actually _means_ for validity and other conclusions.

I suspect just changing the threshold (especially as a new universal threshold, rather than related to the nature of the experiment) wouldn't even strike the authors as an improvement.

> Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention. It is based on the false idea that there is a 95% chance that the computed interval itself contains the true value, coupled with the vague feeling that this is a basis for a confident decision.

Part of the problem is that p-values are not the best indicator in all applications. One question in my work is whether a process change affects the yield. A confidence interval of (-1%,+1%) is much different than (-20%,+20%), even though they would look the same if I was just interested in the p-value. We might also accept changes with a (-1%,+10%) confidence interval. We can't 'prove' that yields would increase, but there is significantly more upside than downside.

Exactly, sample sizes will have to increase with the selective threshold.

I'm surprised the authors don't talk about 'practical significance' vs 'statistical significance'. Statistical significance can be easily gamed, especially if you're relying on one study. I think the real problem is the reliance on one study to make broad generalizations. The 'replication crisis' is everywhere.

I hope that you're wrong because it's solving the wrong problem.

A compatibility interval (at an agreed upon arbitrary level) communicates the magnitude of the difference as well as the uncertainty, which makes it much better for comparing options.

If my medication alleviates symptoms for 94-95% of patients above the current gold standard of 92-93% you could say it's "statistically significantly better", but the marginal improvement may not be worth the investment. Conversely if my medication alleviates symptoms for 50-80% of patients and the gold standard is 45-55% it would at least warrant future research (and if my medication has fewer severe adverse events it might be a better bet overall).

But this is just one small part of the whole picture: ideally we'd have preregistered experiments, experimental data published (or available to researchers where not possible due to confidentiality) and incentives for replication. Maybe this is too much for every field of science, but for ones where a wrong decision could have a severely detrimental impact they would create much more value than moving the P value.

Yes, there's a difference between a "statistically significant difference" and a "practically significant difference". One does not imply the other.

So if I am reading you right, there is the effect size and the "dollar tag" (or other decision criterion) associated to it.

'Nobody wants to get mired in a long debate about possible confounding variables and statistical power in every paper they publish.'

This is literally every paper I publish.

Also, tightening the p-value criterion has problems of its own - as mentioned, it boosts the false negative rate, which is not a consequence free act. In the work I do, it's also a largely arbitrary threshold I can meet if I give the computer enough time.

> some sort of binary test will still be demanded by researchers.

Indeed. P-hacking is so much less fun when you don't have a P to hack.

The fundamental problem is that 0 is a privileged value of effect size. So you can replace a p-value with confidence intervals, or credibility intervals (which are the same as confidence intervals as N increases to infinity, and the data dominate the posterior), or whatever, but it will always be appropriate (in the relevant scenario) to ask "is there an effect size at all?"

This is why these calls to eliminate significance testing always seem really naive and short-sighted to me. P-values are abused, and people confuse p-values and effect size, but there will always be a need to focus on 0 as that supremely-important number. ε can be judged on its practical significance but 0 is always less.

Anyway, I agree with you but wanted to point out that there's two sides to the coin, and both lead in the same direction.

I agree significance is mis-used, but in the opposite way than these authors. They are concerned that authors claim "non-significant" means "no effect," I see a lot of authors claiming "significant" means "causal effect." They don't account for the consequences of running multiple tests, and of endogeneity.

Differences between means of any two groups (e.g. treatment and control) on any outcomes will tend be non-zero. Interpreting this sample difference as a population difference without considering confidence interval seems risky.

It turns out science requires really careful thinking ... who would have thought?

I think that funding and publishing pressure turns the already hard problem of doing good science into a Pareto optimisation between do science, publish, get funding. The result is partially coerced results, stronger than justified conclusions, lenient interpretations , and funding of course.

It’ll go this way whatever the metric. I see the same sort of crap in ML papers where the authors report far more and lie far more at the same time.

I gave a relevant talk a few years ago: How to Kill Your Grandmother with Statistics[1].

The authors are spot on that the problem is not p-values per se but dichotomous thinking. People want a magic truth box that doesn’t exist. Unfortunately there are a ton of people in the world who continue to make money off of pretending that it does.


I like your statement "there is no such thing as a truth machine". But your talk as this article seem to allude that the world would be better off if we just dropped significance testing. AKA dropping the objectivity of science. My guess would be that in that world there would be many more talks about how we need to introduce some kind of objective test to prevent all these drugs from getting to market that were subjectively accepted.

This resonates with common social problems we're all familiar with: People like stereotyping because it's a way to categorize humans without having to think too hard about their differences.

To me the issue stems from an even deeper fear of uncertainty. It takes a rock-solid psyche to be comfortable with the idea that we know little and a lot of our science would have a tough time standing up to statistical analyses.

What happens when your model errors aren't normally distributed?

If the kurtosis is high, p-values are over-stated. If fat-tailed then p-values are understated.

Why? Because the likelihood of your p-value isn't guaranteed to be normally distributed.

Normal is a nice assumption but asymptotic can take a long time to kick in. The CLT is beautiful analytically, but fortunes are made from people who assume it.

> What happens when your model errors aren't normally distributed?

Honestly...? You're screwed. At least in Bio, where most researchers haven't taken calculus, most folks will screw up the t-test or their ANOVA if you are not super careful. For non Gaussian data you better pray it's Poisson or has some other exotic name that you can at least google.

Especially with low N, you just kinda pray it's normal and then you go and try and get grant funding with those results.

Cynically, in the end, it barely matters. It's all about that grant money. Whatever way you can tease that data to get more grants, you just do that. No-one ever checks anyway (Statcheck excepted)

Lack of statistical literary is a huge problem these days. As the modern workforce trend towards more analytical methods, statistics can be used as a weapon to fool and bend the truth.

I'm frankly tired seeing executives going on stage trying to show some numbers and graphs to prove a point on some variables. You see this in board meetings too. The sample sizes are too small to conclude anything significant about it!

> When was the last time you heard a seminar speaker claim there was ‘no difference’ between two groups because the difference was ‘statistically non-significant’?

I'm an Economics PhD (And former professor) and if someone were to say those lines at an academic conference there is a high likelihood that they would be literally laughed at.

Maybe it is because of my background in a quantitative field where we place a huge emphasis on statistical rigor, but t-tests were pretty much dismissed by anyone serious 20+ years ago. Seems like the issue stems to those disciplines without a stats/math background to just point to t-stats. My wife reads medical literature for her work and I gag pretty close to every time she asks me to look at the results.

Yeah - I'm an epidemiologist, and the medical literature is painful. "Statistical curmudgeon" is, as far as I can tell, my primary role as peer reviewer.

The fundamental problem here seems to be, that you cannot get around the need for a statistician (or someone from another field who has a similarly deep understanding of statistics) to look at the data. There is no shortcut for this, but we simply do not, as a society, have enough people with statistical knowledge sufficient to the task.

There is not, I suspect, any other solution but that we must train a whole lot more statisticians. This means we will need to give more credit, and authority, and probably pay, to people who choose to pursue this field of study.

I'm afraid that this problem will only be exacerbated by the influx of "data scientists" that may know how to implement linear regression and a few machine learning models, but lack the statistical expertise to design experiments, vet assumptions, and verify results.

This comment is not meant to disparage anyone who considers themselves a data scientist. However, as someone who has advanced degrees in both statistics and computer science, employers and recruiters outside of R&D roles have shown very little interest in my statistical background aside from my machine learning experience. My experimental design, data handling (not just data cleaning, but data collection) skills, and theoretical understanding are rarely discussed. Statisticians are compensated much less than programmers--maybe deservedly so--but to that extent that I'm compensated the same as a "data scientist" who only studied computer science and didn't study any statistics, I feel like many employers, even those who benefit heavily from them, don't properly compensate statisticians.

At my alma mater, the poor wages have really hurt the statistics program as more students have decided to enroll in the "data science" (typically housed within business or computer science departments) programs. I think this is a really unfortunate trend because while those programs teach you how to implement gradient descent, do basic data wrangling in Python and R, make data visualizations, etc. they don't teach experimental design or the statistical theory that drives applied statistics. Perhaps as the supply of well-trained statisticians decreases and demand increases there will be upward wage pressure, but I think it's more likely that unqualified and inexperienced "data scientists" will continue to be shoehorned into these empty roles instead.

I'm curious, what are the best examples of where your statistics knowledge was useful?

What do you think practicing data scientists should learn to be more effective? Experimental design? Something else?

I could talk about a lot of specific circumstances, but in the interest of brevity I'll say this: The reason I believe a well-rounded statistics education is so valuable is because it ultimately teaches you to think critically about questions and answers. Good experimental design is driven by a solid understanding of theory--what statistical assumptions must be satisfied, which ones can be reasonably assumed or violated, and whether they're realistic to achieve.

I don't believe you need a PhD to call yourself a scientist, but I do think one common trait most scientists share is curiosity. To that end, I would encourage practicing data scientists who might not have a formal statistics education to not shy away from statistical theory. A solid theoretical understanding is what guides you when the questions and answers aren't clear--and I think the fundamental shortcoming of many data science programs is that they prepare their students for extremely simplified (and therefore unrealistic) questions with easily obtainable answers relative to what will be encountered in the real world.

I apologize for giving you an answer that isn't as coherent as I would like to it be. I tried not to be too verbose, but I think I failed at that anyway. I have a lot of feelings on this topic that I haven't fully articulated to myself yet. I could answer your first question if you're still interested, but as an addendum to my original comment, even though my statistics degree has gained me nothing in terms of career advancement or an increase in salaries or opportunities, its been truly invaluable to me as a programmer and a public speaker and advocate for the critical thinking skills it taught me. I hope others continue to recognize statistics value in academia and are curious of it.

I agree with the fundamental problem you say.

But I don't think it's just like a statistician shortage. It's that most researchers are rushing to get as many papers as possible published (which, yeah, is arguaby an 'economic' _incentive_ to do statistics sloppily; but mostly I mean they don't feel they have _time_ to do it right), and most universities are trying to cut internally funded research budgets (meaning no money to pay all these extra statisticians, even if they were trained).

It's a "market" problem with how the work of science is actually materially rewarded and sustained. Papers, papers, papers. (Which for that matter -- a good properly trained statistician ought to have the same academic status as the researchers, but you don't get tenure by helping someone else analyze their research...)

It also doesnt help if papers are submitted to journals and reviewed by peers who also have an inadequate understanding of statistics.

Change needs to start at the top, otherwise the reviewers will not understand what they are reviewing, as silly as that sounds.

I think that the authors of most papers in the biological sciences will at least consult with a statistician. My experience has been that many of these issues are simply hard to reason through and our conventional common statistical language doesn't work very well.

Couple this with the scientist going to the statistican basically saying "what can I claim with some kind of plausibility", rather than being indifferent to whether the result is interesting or groundbreaking or would make a good clickbait headline, and it's hard to figure out what's true.

In my experience it is even worse than that. I interviewed several professional statisticians with regard to consultancy on a particular experiment. They all worried about different things (sources of error or misinterpretation), according to what in their career up to that point had been important.

It was clear to me that to do the job properly I would have to worry about every thing. And I couldn't trust any of the statisticians to do that well for me, as none of them were aware of all the things.

I suspect this is only done well in extremely narrow domains - maybe nuclear physics (e.g. CERN)? Where everyone present is extremely well educated about the statistics - not as a separate discipline, but as necessary background understanding to do non-statistical jobs too.

A game I love to play with people who think I should just calculate the correlation between two series is to tell them to plot cos(x) and sin(x) with sufficient samples to have a smooth chart with many periods. The two curves look kind of identical. Then calculate the correlation between these two series...

Do you need just a statistician or a statistician and experiment designer?

Yet nature forces you to provide p values and n count for everything you can in any figure as if that's enough to guarantee significance of results.

We need to start publishing with transparent and reproducible code from raw data to figure. Show me the data and let me make my own conclusions.

It's not too hard,I'm writing my phd thesis and every figure is produced from scratch and placed in the final document by a compilation script. My jupyter notebooks are then compiled in pdf and attached in the thesis document as well. Isn't this a better way of doing the "methods" section?

I read that as mother nature forces p-values and was, huh? You mean the journal, of course.

This doesn't work for data gathered on humans, which has to be kept private.

Unless you have implemented some new method, I don't see why the code would be of any interest.

Because instead of vaguely describing what I did I can show you exactly what I did.

Instead of saying "we normalized the counts", I can show you EXACTLY what that was that I did.

If I can't see your code I don't trust you.

I think it's cleaner, more honest, more reproducible, and it helps teach younger researchers.

ps. Huge amounts of "human" data are normally public and available for anyone to work with, it's only specific subsets that need to be private.


was written over 45 years ago. Granger is rolling over in his grave every time someone "discovers" a magical relationship between two time-series. In all honesty, statistics is hard and it's something you need to practice on a regular basis.

Statistical significance is required, but not sufficient to prove an effect. Lack of statistical significance means you did not prove an effect, but you also didn't prove there is no effect.

So the answer is more likely "statistical significance and more" rather than "ditch statistical significance".


When we're talking about how to take data as implying X, what is needed is: [logical reason to believe position, how the data was chosen to not bias the whole process, etc] + [data above threshold].

The data that a scientist gets "lives" inside one or another experimental box, some area. But unless the scientists also takes into account how that box and that data came to be, the scientist cannot make any definitive statement based on the properties of just the data.

The statement "Correlation does not [automatically] imply causation" and "Extraordinary claims require extraordinary evidence" both reflect this.

It's such a weak test that while you may use it informally, it should rarely be included in publications.

I think maybe you didn't read the article. This is addressed constantly through it.

Or are you just summarising?

From the HN Guidelines:

"Please don't insinuate that someone hasn't read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that.""


I have always been saying that if an experimental medication application resulted in a useful effect observed in 1 subject out of 1000 this doesn't mean it's garbage and should be dismissed at this point, it can perfectly mean that one person was different in the same way 1 out of every other 1000 people is and 0.1% of the earth population is 7.55 million people still worth curing.

Unfortunately it's much more likely that this person happened to get better for unrelated reasons than because this medication cured them. Likely enough that it's not worth looking into unless this is a condition that effectively no one recovers from.

A few points:

- The basis of a p-value is very much aligned with the scientific process in that you arent trying to prove something 'is true' rather you're trying to prove something false. Rejection of p-values / hypothesis testing is a bit like rejecting the scientific method. I am lucky enough to be friends with one of the physicists that worked on finding the Higgs Boson and he hammered it into my head that their work was to go out of their way to prove the Higgs Boson was a fluke - a statistical anomaly - sheer randomness. This is a very different mentality to trying to prove your new wunder-drug is effective - especially when those pesky confidence intervals get in your way of a promotion or a new grant. Its much easier to say p-values are at fault.

- Underpinning p-values are the underlying distributional assumption that makes up your p-value needs to match that of whatever process you're trying to test else the p-values become less meaningful.

- The 5% threshold is far too low. This means at least 5% of published papers are reporting nonsense and nothing but dumb luck (even if they got lucky with the distribution). If the distributional assumptions arent met then its even higher. Why are we choosing 5% threshold for a process/drug that can have serious side-effects?

- p-value hacking. So many sneaky ways to find significance here. Taleb goes into some detail into the problem of p-values here https://www.youtube.com/watch?v=8qrfSh07rT0 and in a similar vein here https://www.youtube.com/watch?v=D6CxfBMUf1o.

Doing stats well is hard and open to wilful and naive abuse. The solution is not to misuse or throw away these tools but to understand them properly. If you're in research you should think of stats as being part of your education not just a tickbox that is used validate whatever experiment you're doing

It definitely needs to be left out of anything with non-statisticians in the intended audience. I've started leaving it out of most reports. If I write about a difference, it's statistically significant. The test just gives me confidence to write it.

As someone who does a lot of meta-analyses I'd prefer you left in non-significant values as well, if they bear on the hypotheses at hand. Aggregating over nonsignificant effect sizes can still result in an overall effect that is significant.

This. A dozen "non-significant" studies that all have effects in the same magnitude and direction are telling you something.

From Brad Efron in [1]: "The frequentist aims for universally acceptable conclusions, ones that will stand up to adversarial scrutiny. The FDA for example doesn’t care about Pfizer’s prior opinion of how well it’s new drug will work, it wants objective proof. Pfizer, on the other hand may care very much about its own opinions in planning future drug development."

Significance requirements should be approached differently depending on the use-case. The above are two extreme cases: FDA authorized a new drug where significance guarantees should be rigorously obtained beforehand, and at the other extreme, exploratory data-analysis inside a private company, where data-scientists may use fancy priors or unproven techniques to fish for potential discoveries in the data.

Now how much significance guarantee should be required from a lab scientist is unclear to me. Why not let lab scientists publish their lab notebook with all experiments/remarks/conjectures without any significance requirement? The current situation looks pretty much like this anyway with many papers with significance claims that are not reproducible.

We should ask the question of how much the requirement of statistical significance hinders the science exploratory process. Maybe the current situation is fine, maybe we should new journals for "lab notebooks" with no significance requirements, etc.

On the other hand, in the mathematical literature, wrong claims are published often, see [2] for some examples. But mathematicians do not seem to as critical of this as the public is critical of non-reproducible papers in life sciences. Wrong mathematical proofs can be fixed, wrong proofs that can't be fixed sometimes still have a fruitful argument in it that could be helpful elsewhere. More importantly, the most difficult task is to come up with what to prove; if the proof is wrong or lacks an argument it can still be pretty useful.

[1]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=

[2]: https://mathoverflow.net/questions/35468/widely-accepted-mat...

As a layman who probabbly didn't understand that whole article I ask:

If "statistical significance" is just sort of an empty phrase used to dismiss or prove something somewhat arbitrarily. Then isn't the same person writing the same study likely to be just as arbitrary declaring what is or isn't significant .... anyway?

There are strict rules that define when something is "statistically significant", it's not at all arbitrary. The problem is people thinking that just because something is statistically significant, it is automatically true. Which it isn't. Statistically significance by definition includes the possibility that something was just a statistical oddity. This article essentially just reminds people of that, and urges them to abandon the "statistical significance is the same as ultimate truth" conclusion.

I feel like this misses another important prong of the article, which is: that failing to find a "statistically significant" correlation (according to some given significance test) is often mistakenly interpreted even by scientists themselves as meaning "We have a good basis to conclude that there is no such correlation".

There aren't is the thing. There are widely used standards, but they don't actually have any real basis. Fisher I believe is the one who popularized it but it was for specific circumstances and he acknowledged that it was just a convenient thing.

Sure there are. You can't just call a result "significant" at will. You can pull numbers out of thin air, pre-filter your data or carefully pick a statistical test to be in your favor. But it's still well-defined which outcomes you're allowed to call significant and which ones you can't.

Thank you.

Statistical significance is a simple idea, it's nothing more than the realization that if you flip a thousand coins, getting a thousand heads means they probably have heads on both sides or something, while getting fifty-fifty doesn't mean much. On the other hand, if you flip one coin and get 100% heads, you haven't proven much of anything either. Statistical significance is just one way to put a number on how likely it is that you're reading tea leaves.

It can't stop you from lying, cheating, or stealing. Nobody has ever not been able to commit fraud because of p-values. There is no formula in the world that can notice when you are cheating in what numbers you plug in...

However, if you know what you're doing, and you're honest, it's really important to know whether you are seeing shapes in clouds or actual patterns. That's what statistical significance is about.

Thank you.

the idea of the p-value is a useful and reasonable one. it's a good statistical tool that can help you tell whether your results are due to something real in the world or just dumb luck.

the problem is people leaning on it too heavily and especially obsessing over the 0.05 threshold: so that p=0.049 means your result is statistically significant, you get a paper published and a press release and tenure, while p=0.051 means failure and penury.

the article is arguing specifically against this latter practice, of "bucketing" things according to some cutoff which was literally made up arbitrarily by Ronald Fisher.

Thank you.

I feel like this is one area where clickbait media has pushed things backwards. Everyone wants the clicks so facts from studies get skewed into binary results when its most always shades of gray. If I see a study and it shows that you may be slightly less likely to get alzheimers if you drink green tea every day, but only on the order of half a percent or so, I dont have a magic cure all to alzheimers. But you will see news headlines "Green Tea cures alzheimers! and may even be effective for ED!" Maybe we shouldnt rise against statistical significance and push back on incorrect dissemination of the results?

It's interesting that they talk about a category error around "no association". In fact there is a category error in applying statistical thinking in cases where objects are not comparable - like human metabolisms, ecosystems, art...

Most arguments used in this discussion are much better exposed in the book


The article spends too much time saying what not to do and not enough time saying what to do instead. People treating p-values dichotomously are doing so because they think that's what they're supposed to be doing. So while it's amusing to rant as in this article, it should have devoted its efforts to a presentation of exactly how a paper should be written in this new age. I suppose there's plenty of opportunity for others to do that, but this seems high profile.

I would never guess that prople would read:

"I measured no significant difference."

to mean:

"There's no difference."

and not:

"I couldn't measure precisely enough to see what difference there is, if any."

So we have 800 scientists signing a paper, but there are on the order of 7 million scientists worldwide. To prove the hypothesis "scientists rise up against statistical significance" with 95% confidence level and a .1 confidence interval, we need a sample size of 857,462. A sample size of just 800 is clearly not statistically significant, so the paper is meaningless and the hypothesis can be rejected. Am I doing this right?

What you want is a one-tailed Binomial test:


You need these ingredients: - The scientists polled for whether they're rising up need to be randomly sampled from the population of 7M - A threshold for the fraction of scientists rising up to consider an uprising to be occurring (e.g. 90% of scientists rising up indicates an uprising)

Then you can poll your scientists and feed the results into the Binomial test to find out whether there is statistical evidence to support the hypothesis that an uprising is occurring.

Note that the sample size to do this might be quite small. In general, the required sample size is not dependent on the size of the population, but rather on how close the actual fraction of scientists rising up is to your chosen threshold. If the true fraction of scientists rising up is much smaller than your threshold, say 10%, and your threshold is 90%, it might only take a random sample of 10-20 scientists to be confident there's no uprising. But if the true fraction is 89.99999% it would take a huge sample to be confident that no uprising is occurring.

> Am I doing this right?

Sampling error is independent of population size, so, no.

Also, the forces influencing the society are not statistical, at least not in this sense. A small minority in a group is definitely capable of challenging old norms and installing new ones. In fact, that is how it usually happens.

Something I find helpful when doing statistics is to clearly state exactly what you’re trying to show or calculate before you calculate anything. “with 95% confidence level and a .1 confidence interval” is not clearly stated.

Here’s something you might credibly calculate: if you ask n scientists a yes/no question, but actually only half of scientists worldwide think “yes”, then what’s the chance that m or more of them in your sample say “yes”? The answer is the p value, and it’s not very hard to calculate.

The point of this article is that many, many papers do this calculation and ascribe totally inappropriate meaning to it. And, of course, when you clearly state what you actually calculated, it’s pretty obvious that it’s a terrible way to draw conclusions about what fraction of scientists think “yes”. This is the point of the Nature paper.

P.S. I don’t know what you’re trying to calculate, but I think you did it wrong. If you want your calculation checked, can you clarify your question?

It was meant to be tongue firmly in cheek. Poe's law strikes again...

To say nothing of the selection bias at work here.

So we have 800 scientists signing a paper, but there are on the order of 7 million scientists worldwide.

Hence rising up against statistical significance.

I think you provide a convincing case of the problems of statistical significance:

Without context on the 800 scientists that signed, we don’t know whether it’s stastically significant or not. 800 Nobel prize winners would certainly mean more than 800 graduate students. The discipline they were in could matter. I didn’t check you’re math, and I’m sure it’s fine. Raw statistics cannot reject the hypothesis (or null).

Huh? This has nothing to do with statistical significance.

Also, it sounds like the authors didn’t sample scientists at all. They solicited signatures with some minimal qualifications. The 800 number wasn’t claimed to prove anything.

Ever hear of the statistician who drown in a river an average of 1 foot deep?

It's worth emphasizing that the concept of statistical significance is still super useful. It just doesn't deserve to be as central to the scientific process as it currently is.

But it should be central to scientific process. If you do any kind of experiment and gather the results, then the first thing you need to do is to make sure that your results are real, and not just some random noise in data. Otherwise you're not doing science anymore.

...but does statistical significance really allow you to distinguish real results from random noise? I think it's pretty clear that it isn't a good tool for achieving this goal, due to the "p hacking" phenomenon https://en.wikipedia.org/wiki/Data_dredging

Specifically, the issue is with null hypothesis significance testing. That's what this whole "statistical significance" debate is about, because "Scientists rise up against NHST" isn't clickbaity enough :(. Yes, statistics are central, and we have to answer the question you posed, but we don't have to use NHST to do it.

First, there's a lot of science that's done without "experiments".

Second, the problem with significance is it's pinning "not just some random noise" to an arbitrary threshold. Results with a p-value of 0.051 and those with a p-value of 0.049 are effectively the same, and yet are treated as if they're worlds apart.

I for one flagged this kind of crap in one my recent peer reviews. It is a transgression that is all the worse when their sample size was woefully small.

I must be stupid. It feels like a Monty Python word game designed to confuse. There's no way I can not disbelieve a statistically insignificant study outcome unless I want to allow myself to believe in some (but which ones?) statistically insignificant ones.

The key is that there is a distinction between "not statistically significant" and "statistically insignificant".

I see a lot of health-related signatures on the list. Is this a backlash because of the findings against homeopathy?

No. I'm one of the signatories, and an epidemiologist.

First, the authors are well-known in the area. Sander Greenland is a major figure in epidemiology methods, etc.

Second, there's been a push there, historically, to get rid of p-values because the field is primarily more focused on effect estimation, rather than pure hypothesis testing, so p-values are particularly poorly suited.

I don't think so no. I think in health there is just a lot of pressure to come to an answer (does this medicine hurt or help) and to have a justification for that lawsuit for regulators and attorneys.

which findings?

Reviews like this ...


that have led to various national health systems refusing to pay for homeopathic treatment.

The studies are almost universally p-value based.

How many of them rose up?

My general rule of thumb is if I'm having a debate about statistical significance, I'm debating the wrong thing, should stop talking, and get more data. Preferably so much more data that the question answers itself without having to test for significance.

This doesn't work, if you need expose patients to a novel drug to gather data.

The era of medicine where we administer a 100 mg pill of a small molecule to 3000 patients, calculate a P value and release it for sale is dead. There's a huge industry, so it's not aware of its death yet, but the research world has moved on. I think the paradigm will be based around deep understanding of the genomics, proteomics, histology, spatial distribution of the problem in the body, and licensing for sale platforms that produce custom proteins, T cells, small molecules, <intervention of choice>, designed on the fly. It's 30 years out, but if you're betting the bank on a small molecule in a phase 3 trial, that model is only going to hold up for so long.

I think I will forever be bitter about an argument that I had with a professor in college about statistical significance. There was a study with 12 (or some other tiny N) people who were told to click a button when they saw some change as a laser was moving in a circle. They were then asked to go back and pick the place where the laser was when they observed the change. The study found that people remembered clicking the button 10ms (or some other tiny value) before they actually did. This was clearly grounds for all of us to question whether humans had free will at all, because the result was statistically significant after all! When I challenged the professor on this, I was told that I should take a statistics class. I think that professor still turns me off from philosophy to this day. This happened in a philosophy class

Your description almost matches an experiment done in EEG research called the Libet experiment [1], although it's a bit different than how you describe it I'm confident this is what you are referring to.

They find that when you perform an action, there is a EEG spike [2] in the motor cortex well before you actually consciously decide to perform the action. The experiment is conducted with a dot running around a circle and the subject has to tell when (as per where the dot was) he decided to act. The EEG potential is seen prior to that decision moment.

This is related to free-will as it is as if the decision of acting is not coming from your conscious self but from a deeper layer.

[1] Libet experiment: https://www.youtube.com/watch?v=OjCt-L0Ph5o

[2] Bereitschaftspotential: https://en.wikipedia.org/wiki/Bereitschaftspotential

I remember hearing psych-major classmates discussing that experiment in the dorms after class. It's kinda mind-blowing when you think about it.

I read (in a book sitting in a science museum giftshop, so take with the appropriate grain of salt) that most of Freud's theories have been shown to be bullshit in the century and a half since he lived, but his enduring influence was showing that the vast majority of human behavior is unconscious. Instead of thinking of your consciousness as the primary actor that determines what you do, you have to think of it as a vague overseer that occasionally notices the body doing something and can intervene with enough time, exposure, and practice. And whole industries have been built upon that principle - advertising, mass media, propaganda, behavioral finance, therapy, coaching, education, gaming, gambling, cigarettes, coffee, travel, social media.

There was a HN thread 2 days ago about the rise and fall of scientific authority and how to bring it back. I cynically commented that the rise of the physical sciences came from the ability to win wars with them. You could look at the rise of the psychological and social sciences as coming from the ability to make money with them.

Wars are fought and won with psychology every day. Only fairly trivial “big army” wars are possible today.

Though Freud is often credited with coming up with the concept of the unconscious, it was actually quite well known before him.

Perhaps I'm missing the point here but why does this say anything about free will? Of course you can't make split second conscious decisions. But you've still made a conscious decision to prime your subconscious facilities to act in a certain way.

The consciousness is super slow. It doesn't make sense to have it "do" anything. But it's good for making executive decisions. The CEO doesn't make the product.

It meshes with certain other results, for example on timing of predictions (http://www.kurzweilai.net/more-evidence-that-youre-a-mindles...), and most interestingly on split-brain patients (http://www.powerofstories.com/our-brains-constantly-confabul...). In the latter case, split-brain people would make choices based on information only available to their right brain, then when asked to explain them would unconsciously invent an explanation for that choice which was based only on information from their left brain.

I'm not sure philosophically whether our inability to understand our decisions undermines our free will, but it certainly undermines any ability to consciously prime ourselves to make certain decisions - hard to have that feedback loop when you don't even know what decision you made!

Is saying it "meshes with specific other results" just a signal of cognitive bias?

I personally would confidently guess there are unconscious faculties in the mind, but I dont see how this experiment is remarkable in proving this, see here(1). Is it not an equally likely conclusion that the brain takes a few hundreds of a second to develop a conscious decision? Actually, the inverse of that is what would be remarkable.


The issue isn't whether a decision is unconscious or conscious. It's that we often think that we've made a conscious decision when the decision process we narrate for ourselves and others is provably impossible.

I should have phrased the original as "it is a relatively weak example of this specific set of experimental results".

That's pretty much my line of thought. If I decide to bicycle somewhere, I'm certainly not consciously thinking through most of the act of actually cycling, but I have no doubt that the "uuugh, this is going to suck but I need the exercise" decision happened on a conscious level.

> They find that when you perform an action, there is a EEG spike [2] in the motor cortex well before you actually consciously decide to perform the action.

No, that's not what they find. What they find is that the time of the change in EEG (it's not really a "spike", it's more like the leading edge of an increased action potential that lasts for a significant time) is a few tenths of a second before the time that the subject reports as the time they "made the decision". But you can't assume that the time the subject reports is "the time they made the decision", because the process of generating the conclusion "this is when I decided to act" also takes time--and that time was not measured. All you can really conclude from this experiment is that people are not consciously aware of all of the neural processes that actually go into their making a conscious decision.

Actually, this explains a feeling I've been having lately. In various occasions where I have to randomly pick between some things, I choose X but I can somehow feel that the actual choice happened a tiny fraction of a second before I was actually aware of it (kind of a contradiction, but I can't explain it better).

Basically, it feels like the decision was made a tiny bit earlier than "I" made it.

Why isn't the explanation that the reporting function is slow? Your "free will" -which we won't explain- decides to fire the motor neurons, and then does the reporting, which is pretty unnatural compared to moving a muscle, to do the thinking about where the dot was.

Or maybe you overcompensate when calculating where you thought the dot was.

The video addresses that. The experiments were constructed not so that you report "I decided to press the button", but rather by reporting "The state of the world [as measured by this rotating dot on the screen] was this at the time I decided to press the button." And he found that the state of the world that people reported when they pressed the button was consistently 200 ms before actually pressing the button (which is in accord with other experiments measuring human reaction time), but the EEG spike in the motor context were 350 ms before the subject's perception of reality at the time they made their decision, and 550ms before pressing the button.

Most people think of their consciousness as "That part of my mind which integrates all my sensory information into a coherent experience of reality at this moment, and then decides what to do within that reality." You could redefine free will as the EEG potentials themselves operating within your brain - but that's not how most people, subjectively, experience it, particularly because they are not generally aware of most of these EEG potentials directly.

I'm not going to pretend like I'm sat here on my couch and know more than those scientists, so can someone help me understand how they generated a free will hypothesis about this? Seems to me I could explain this with something like "short term memory assignment is delayed," i.e. the decision was made by the conscious brain, but the memory of making the decision had a long write time and therefore a later "timestamp."

I would submit this other similar experiment from YouTuber Vsauce [1] where a machine is trained to react to his subconscious thoughts before he consciously chooses to press a button.

[1] https://www.youtube.com/watch?v=lmI7NnMqwLQ Note: full video is behind YT Premium but the preview is enough to understand the experiment.

Isn't that easily explained by the time it takes to decide to speak after deciding to act?

Coming from a deeper layer does not mean it's not freewill.

I'm sorry an incompetent professor left such a bad taste in your mouth, and on the field as a whole.

What's strange to me, is that my interpretation of the results of such an experiment wouldn't even lead to your professor's conclusion. The takeaway being the fallibility of sensory perception, where I might then prompt the class for a discussion of their intuitive refutations of empiricism before diving into the literature.

Unfortunately, being a philosophy major myself, I know all too well that a crap teacher can totally ruin a philosophy topic (let alone a topic of any subject). From my 4 years in philosophy classes of varying levels of difficulty, the common denominator between a fruitful time spent in class has been the willingness of the professor to engage with their students. Whether it's logic, metaphysics, epistemology, ontology, &c, the principal property of a quality professor is his/her dialectical ability.

Hell, that's how philosophy & theology was taught in the first universities! The professor would profess and then the students would engage their master in the subject at hand.





1. claim that one has (a quality or feeling), especially when this is not the case.

"he had professed his love for her only to walk away" synonyms: declare, announce, proclaim, assert, state, affirm, avow, maintain, protest, aver, vow;

2. affirm one's faith in or allegiance to (a religion or set of beliefs).

"a people professing Christianity" synonyms: state/affirm one's faith in, affirm one's allegiance to, make a public declaration of, declare publicly, avow, confess, acknowledge publicly "in 325 the Emperor himself professed Christianity"

Brief etymology of "professor":

From Latin "profiteri", to the form "profess-" meaning "declared publicly", and to "professor", then to Late Middle English as "professor".

So a professor's practice is probably closer to definition 2: "make a public declaration of" whatever one's skill or knowledge of a particular art might be.

Genuine question. What about the professor's statements were inaccurate or incompetent? Is the sample size really too small? Is the claimed conclusion about free will invalid? Or is the criticism just the dismissive tone toward the student?

The incompetence originates from the disregard of the parent commenter's question/concern. It's the result of not engaging in good faith with your student, not necessarily the conclusions drawn. As I mentioned in my original comment, the value in taking a philosophy class (especially as a student in a different field) is the chance to engage with both the professor and your peers; it serves as a veritable petri dish for developing one's ability to succinctly articulate and debate topics. If you're expected to sit in a philosophy class and just absorb the material without any contrary thought, something is seriously awry. It goes against the very nature of why humans pursued philosophy in the first place.

Furthermore, it seems strange for a professor of philosophy to so easily dismiss criticism out of hand. Of all subjects, a philosophy professor has a pedagogical imperative to entertain contradictory positions and explain why or why not one ought to follow a line of reasoning. In addition, the question about the merit of a small sample size could itself serve as a valuable aside in teaching fundamental notions in the philosophy of science.

Note: This is from the perspective of Western analytic philosophy, but the spirit of debate and discussion is no less integral to the continental tradition.

This reminds of a professor of mine who jokingly says that he is no longer teaching, but preaching.

I mean N = 12 is incredibly small to make such a sweeping statement about all of humanity but further it implicitly accepts that 1. free will is demonstrable via the experiment, 2. the reaction wasn’t preempted by free thought leading to the decision, and otherwise, 3. you’re a bad philosophy teacher if you’re trying to prove philosophy with statistics, imo.

This is a well known experiment that has deep implications (https://en.wikipedia.org/wiki/Neuroscience_of_free_will see Libet experiment among others). It doesn't question free will directly. Making decisions without being consciously aware doesn't contraindicate free will.

Personally, I side with Dennett (as always).

Also, many studies have n=12 and still have statistical power.

I imagine that it takes longer for your vision system to generate a useful output than whatever is responsible for telling you that you just stubbed your toe.

If signals like touch, smell, sight, etc. arrived at the input to your perceptual system immediately, you'd get a bunch of inputs for the same event at different times, and it would probably be difficult for your perceptual system to make sense of of them.

I think the only thing the spinning wheel experiment shows is that the brain has some sort of perceptual delay/compensation mechanism that's probably there to account for these differences in input processing times. And it probably backdates the "timestamp" of the event so that things that rely on short time intervals (e.g. control tasks like balancing) still work reliably.

I don't know why anybody would think this says anything about free will. Philosophers are weird.

> I don't know why anybody would think this says anything about free will.

If some decisions that we think are made consciously are actually unconscious, then how do we know that any decisions we make are really conscious decisions?

What if our consciousness makes no decisions, and is just a figment of our imagination.

Hm. I used to do competitive air rifle shooting (fwiw top five in the nation at the time), and this experiment comes close to the sort of training that I would do with my coach. He would act as a spotter and after every single shot, would ask me where my shot was (eg, 9pt at 4o'clock). We did this probably 300 times in the course of my training. Despite the "jitter", it was something that could eventually be done with surprising consistency after a while of training this way. It would be a completely different matter if the training was done on a new shooter (who were never, ever good when they first start). The training was also complete ineffective if I had caffeine, or wasn't in the right mood. My guessing ability would then effectively match what was seen in this test.

I'm not sure how this test would have proven anything, especially with such a small sample size. As another poster mentioned, this very much sounds like a test for sensory motor latency, which is absolutely a nontrivial thing for this sort of test.

If the test you described was done against trained shooters, I'd imagine the conclusion would likely be the opposite.

How is this experiment related to free will at all? Sounds to me like something about sensory/motor latency.

Yeah I would've done the one were you have everyone in the room close their eyes and think of a "random" number, write it down, then compare the distribution.

Results like those are what make me question free will (My mind "weighted" my answers in ways I can't control).

How does being a bad random number generator mean that we don't have free will? I don't see these two as contradictory.

These are good examples of the fact that our common intuitions about free will are disconnected from reality, often silly, and usually plain wrong.

For example, it's common to have the intuition that the choice when picking a number from 1 to 10 is completely unbiased and independent. That would imply a perfectly uniform distribution.

Of course that doesn't match reality, and it raises more questions for the intellectually honest thinker, but most people don't think about it beyond the simple intuition.

Yes, but that's just some humans being uneducated about how the human brain works. It doesn't really say anything about the existence of free will. In fact, the reason untrained humans are bad random number generators has little to do with the abilities of the human brain and more to do with how random distributions work.

Untrained people just don't know, for example, that long streaks of heads and tails are actually quite likely in a sequence of random coin tosses. If they did know the likelihood, and they were tasked with writing down a sequence of random coin tosses, they would probably do a much better job.

If you take this to the limit, a human could learn to compute a pseudorandom generator in their head (at least in theory, although it may be very slow going) or perhaps figure out an effective way to gather entropy from the environment and turn it into a truly random sequence.

At least when I did it in class we had lots of certain numbers, essentially none of certain numbers. A lot of them lined up with common "lucky" or "unlucky" numbers, even if everyone said they didn't pick it for that reason.

To me that shows us how outside forces influence our thoughts. Did I want to buy this garbage bag because it's a good value, has a good design, or because I've heard "DONT GET MAD, GET GLAD!!" 1000 times? Every thought we have is affected by our experiences, and many experiences can be controlled - they call these advertisements.

I find this topic fascinating. I suffer from depression, and my thoughts during a major depressive episode can be the stark opposite from when I feel well (Let's say it's about suicide). How can I truly have "free will" if my decisions are so dependent on external influences?

How can I truly have "free will" if my decisions are so dependent on external influences?

You don't have actually free will, if by free you mean "uninfluenced by anything".

There's no evidence that such a thing exists, and anyone who claims such a thing invariably appeals to supernatural or hand-wavy explanations.

Note that many thinkers who claim free will exist actually are saying "what most people mean by 'free will' does exist", and usually that means "an agent capable of making choices without outside influence". This idea is compatible with determinism, so these people are called compatibilists.

How is being an agent capable of making choices without outside influence compatible with determinism? My understanding is nobody can ever be an agent capable of making choices without outside influence. A person is fundamentally influenced at birth by external forces and moulding the person by deterministic process without the person having any real control. Furthermore one could only say they had free will if before life one agreed to take role of birth as a specific person and with knowing how life will play out till the end. Compatibilists are people who don't understand determinism because free will doesn't work with determinism. /hard determinist.

I said "what most people mean by 'free will'"

For most people, "outside influence" means, "at this point in time, only the contents of my brain affect the choice I make" and "no force or agent outside myself is compelling me to make a different choice than the one I'd make otherwise".

It's a word game of course. Even if you try to be extremely precise with your definitions, to avoid this kind of thing, most people won't quite follow.

Compatibilists believe what you believe. They just agree to use the popular, vague definition of free will instead of the strict one you're using.

> How can I truly have "free will" if my decisions are so dependent on external influences?

because you can recognize this as a pattern, make a supposition that this is related to previous conditioning, and retrain yourself at picking numbers randomly to eliminate the previous bias that you yourself recognized as 'problematic' here.

That's fated to happen by the external forces making it happen.

It's funny because statisticians are among the loudest critics of how easily statistics can be abused. I encourage you not to be turned away from a whole field because one bad representative. Professors are humans too and say a lot of dumb things. Very few are their field incarnate.

STATISTICIANS actually do fine, generally. The problem is that no statisticians were involved in the analysis and write-up of almost all research that isn't statistics papers.

The statistics just don't mean what most scientists think/write they mean. They're using em wrong. As the OP explains some manners of, in fairly technical language.

Statisticians are doing fine indeed. The problem lies entirely in the ( non) design of experiments ( DoE) of the “scientists”. The sorry state of affairs of graduates from universities in properly designing experiments in which proper randomization, focus on sources of variability, between groups and within groups is not understood nor accounted for in the design is the true source of the problem. Scientists should always consult a statistician when undertaking an experiment. At least if they would like to make/draw sensible conclusions.

Yes, I was just saying along with OP that a statistician would be most critical of the professor's reasoning. It's like a tyrant saying you should take a class in morals when you raise an objection to their atrocities.

What do Bayesians say these days btw?

I still find Bayes to be more grounded and less “pie in the sky” than frequentists.

It all depends on your priors what they say. :)

But seriously, you can have the most pathological prior distribution, so you can then stick with it forever. (Let's say your prior predestines you to always find that whatever the new piece of data you get is so unlikely that it's more likely that it's an error/conspiracy than a real piece of data that you have to do belief update on.)

So, instead of coming up with a significance level, you have to estimate the chance of observing a null result, which determines how much new data moves your posterior distribution. The "advantage" of the Bayesian approach is that - in theory - you can incorporate every tiny little bit of data into your model (distribution). The disadvantage is, that it's very susceptible to various biases (through a biased prior).

Bayesian statistics are mathematically perfect, as long as your priors are true. Unfortunately, they are not.

I think it’s the opposite. Your confidence intervals are only true if your distribution is true. But how can it be that perfect? At least Bayesians adjust their knowledge like scientists.

Dave Rubin recently had an interesting interview with Onkar Ghate, an objectivist philosopher, about free will. You might enjoy it and it might rekindle your interest in philosophy: https://youtube.com/watch?v=rvush0oW-cw

IIRC, this interview with neuroscientist Dale Stevens dives into the claims (similar to your example) from the past few years where scientists try to disprove free will: https://youtube.com/watch?v=X6VtwHpZ1BM

Maybe this professor knows it but has found themselves stuck for the rest of their life in this career.

The people with free will refused to do the experiment.

Reminds me of my last Google interview where, upon not being able to conjure up the complete solution to a puzzle solver in 25 minutes, the interviewer's response to my question of "well I thought the interview would be on actually implementing data structures & algorithms as opposed to using DS&A to solve puzzle challenges, do you have any recommendations on how to get better at these puzzles challenges?" to which he replied "yeah definitely study data structures and algorithms." .... like riiight, thanks for that useful nugget of information - I'll be sure to ask for a refund for all that college tuition since I've obviously never ever heard heard of those data structures and algorithms since I couldn't solve a problem in an arbitrary time span. Sorry, not trying to diverge from the topic at hand, but it was the same kind of ridiculous response that is so disconnected from the frame of reference of the question that I couldn't help but share.

I should have had him solve a captcha for me, still not convinced that it wasn't just a reverse turing test.

I can't stand when people give flippant responses to fair questions - I can totally imagine your frustration in your case.

They might not even know the answer themselves and have a solution in front of them that they are comparing yours to.

Then they just said the first thing that came to their mind rather than that they don't know.

Wow. The issue discussed in the beginning of the article is really basic (evidence of absence vs absence of evidence). There are much more intellectually challenging issues with statistical significance. If scientists don't understand this one, it's a really sad sign.

I'm not sure how well a difference in nomenclature can fix such serious misunderstandings, but I do like the "compatibility" suggestions and the way they talk about the point estimate and endpoints of the confidence interval.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact