Hacker News new | comments | ask | show | jobs | submit login
What's the significance of 0.05 significance? (2013) (p-value.info)
68 points by xtacy 11 months ago | hide | past | web | favorite | 32 comments



The problem with all of this is that the reducto-scientific paradigm for understanding science has long been extended to communicating and teaching about science.

As the author does a really nice job explaining, and as seen in the Fisher quote, these articulations of heuristics and guidelines are often taken as closed form rules.

Look beyond statistics and you see it everywhere. Common ones include entrepreneurship and design...in both areas, experts' ways of thinking are often highly situated, highly metacognitive, and the actions they take are inherently inseparable from their thinking process. However, because of the academic drive towards objective/deterministic/observable phenomenon the research tends to report and attribute only the actions. The result is that those actions, rather than the underlying thinking processes, are valued and taught.

The result is simulations of expertise masquarading as knowledge. Its one thing when its students, but as you are seeing in psychology's 'replication crisis' (which, side note, is kind of a metaversion of its own critique) it can create real problems when surface level understanding is accepted and generalized as a normative 'truth' in a field. You see it in economics and business a lot...strive to appear scientific, but do so in ways that inherently betray the underlying structure of what you are studying. It comes from an underlying value in those communities, and society, that the only truth is objective truth.

If I have an experiment where I am screening 5 possible predictors and I get p values of .9 for 4 of them and .52 for 1...I would be an idiot not to pursue the 1. if I get 4 .49s and 1 .00000000001...same thing. Statistics is relative, literally.

[happy to provide citations...not sure anyone really cares]


I care, and agree with your points. It seems at least part of the fetishization of p < 0.05 has arisen from its common use as the cutoff point in drug clinical trials, and thus represent a make or break point in a multi-billion dollar industry. There needs to be some predefined standard for such trials to prevent cherry-picking and other games, but as the blog describes, this could have just as easily been a different threshold. Similar observations can also be made about the values used to set study sample size (based on somewhat arbitrary alpha and beta).

The arbitrariness at the heart of the regulatory enterprise may seem disconcerting, but the alternative (no shared standards for evaluating drug efficacy) has also been tried historically, and the result was markets flooded with useless, often dangerous, products many of which nonetheless sold very well.


> It seems at least part of the fetishization of p < 0.05 has arisen from its common use as the cutoff point in drug clinical trials, and thus represent a make or break point in a multi-billion dollar industry.

I think that is a bit recursive. Its not so much a fetishization as it is a misunderstanding that results in it being a valuable target.

It would be a heck of a lot more useful as a target (and granted...I would argue the target should be more like .001) if more research adopted Bayesian statistical techniques where you can't as easily P hack.


The standard ritual for measuring significance in research seems to me to be some strange marriage of the ideas of Fisher, Neyman, and Pearson that I'm not sure any of them would have actually agreed with. I'd be interested to hear any historians of statistics or scientific methodology comment more on that angle or correct my misinterpretation if thats what it is.



Semi-off topic - as someone who does a lot of data analysis with sql but has never taken a statistics course - can anyone recommend any resources about where to learn about how to calculate / apply p-value, r-square etc?


https://www.coursera.org/specializations/statistics

Skip all the programming exercises in R - just watch the videos and solve the multiple choice problems. Supplement with the decent opensource text book it links to.

Each "week" is likely only 1-2hrs of work. ~5 weeks per course. Only really need the first 2 courses:

   . Introduction to Probability and Data

   . Inferential Statistics


I wrote a couple papers for clinicians introducing p-values and the necessary apparatus to understand them correctly: http://madhadron.com/posts/2016-01-25-p_values_for_clinician...


>"The P-value is the smallest relevant value of α given your data (i.e., the smallest probability of making a Type I error and deciding there is an effect when there isn't one)."

Nope, the p-value calculation assumes there is no effect. How can it be the probability there is an effect?


Reading comprehension: "deciding that there is an effect when there isn't one."


p-values are very misunderstood and a lot more subtle than most people believe.

If you're interested in p-values, I wrote a post on them here with some counterintuitive examples (one of them shows how a lower p-value can sometimes increase your belief in the null-hypothesis).

https://mindbowling.files.wordpress.com/2016/07/pvalues.pdf


This is great - you should really put your name and details on there. You deserve the recognition!


Thanks! I'm glad you liked it. I guess I didn't think about putting my name on there, but maybe I will :)


What frightens me more is how rarely I see talk about decision theory and hypothesis testing in the (deep) machine learning community, it is as if people consider the classification output as sufficient evidence of recognition, just quote that max(p(class)) rather than the significance of the class given its classifier score. Am I missing something?


What do you think calculating the "significance" would add?


Most dCNN classifiers I’ve seen used simply rely on softmax to provide a “probability” among the classes. But it really is a score that has been normalized across the classes. Having 0.7 of one class does not mean the same level of discrimination from the remaining classes as it does for another class with the same score. By only using the maximum scoring class you don’t account for what score value is sufficient to claim a significant discrimination between the maximum score and the alternatives.


I see how the probability (softmax output) contains more info than simply the class with highest probability, but not what you want regarding "significant discrimination".

Perhaps you want to weight different types of errors differently? Eg: https://github.com/Hezi-Resheff/paper-log-bilinear-loss



Someone needs to reference the XKCD strip on jelly beans.

https://xkcd.com/882/


Scientists needs to start using a "training set" and a "test set".

If you have 2000 samples of data, you don't train your model on it and then say that's your success rate. You'll end up with conclusions that don't generalize.

Instead train on 1600 and measure your success on the remaining 400.

Similarly, don't look for statistical significance among your 2000 samples and conclude that's the result. Do it across 1600 and then validate it on the 400. If there is a real result there, it'll reproduce. It now makes your process robust to overfitting / param hacking.

You avoid the green jelly bean problem entirely.


> You avoid the green jelly bean problem entirely.

Not entirely. Quite often, the big fishing expedition research is done on things like epidemiological data sets, where you've got a million monkeys with a million typewriters trying to publish a million papers, all based on one data set.

Under that kind of situation, assuming that everything works out perfectly, after the first pass you'll have some random non-negative number of hypotheses that were collected from fishing that you're going to test again. And those tests will also have a 5% type I error rate, so the fishing expedition will still have a problem with multiple comparisons. Your cumulative risk of a type I error will be mitigated, but not reduced to the traditional 5% (or whatever alpha you were shooting for).

Meanwhile, your risk of a type II error will have been increased considerably: You're replacing one test of a relatively high statistical power with two tests of relatively lower power, and set up a situation where a type II error in either of them yields a type II error in the overall test.

Not saying that there's no value to approaching things that way, but it's not the free lunch you're suggesting it is.


Science often works with data set sizes that make the standard ML approach unworkable (e.g. 12 measurements which took months and thousands of dollars to make)


The solution isn't proposed to use ML. It's simply an analogy.

If you're performing statistical inference on sample sizes of 12 you can't then be surprised by a lack of predictability.

And I'm pretty confident the majority of experiments are using sample sizes larger than 12.


>And I'm pretty confident the majority of experiments are using sample sizes larger than 12.

It definitely varies by field, testing 3 doses and a negative control with 3x replication would be a pretty good experiment for testing a drug in mice. It's plenty of data for identifying a drug that actually works, as long as your statistical tools aren't designed with expectation of thousands of samples.


I don't think iskander though you were advocating using ML, but generally saying that the approaches _usually_ used in ML would not work in some datasets which are way too small.

A 12-point dataset is not desirable _per se_, but sometimes it's all one has - as is the case, for example in health-related research.


In which case, you can't really afford train and test splits but you can look at approaches like leave-one-out to estimate what your sampling error looks like.


And, more generally, bootstrapping, which is a class of methods for re-sampling one's data in order to estimate uncertainty.

https://en.wikipedia.org/wiki/Bootstrapping_(statistics)


I think leave one out is actually a special case of k-fold cross validation where k is equal to the number of datapoints. But yeah, you can use bootstrapping, cross validation, the jackknife. There's a large toolset there, and I trust an analyst who explains what motivated them to choose a certain tool rather than one who just uses a tool because it's what they always use or it's expected of them.


This is the standard recommendation, but does it really help?

You do that test on the second subset, and then you discard every theory that doesn't pass both the 1600 set and the 400 set.

So you end up with the predictions that pass all the data in your original 2000 samples.

Is it really any better at generalizing to new data? If so, can you just evaluate your theories on the unpartitioned data by doing randomized subset testing after-the-fact?


It depends on how often you go to the well. This is why things like Kaggle have limitations on how often you can submit entries for evaluation. This is why some studies go with a three-way split of train-calibration-test. And you can look at things like cross validation, boot-strapping etc. Nothing is perfect. As soon as you lock yourself into a one-size-fits-all way of validating your out-of-sample validity, you'll find lots of people who are optimizing for whatever test you come up with, not out-of-sample validity. It pays to pay attention to how well the researcher understands and deals with their particular set of problems, not how well they meet some arbitrary criteria. (Which is exactly how we ended up having all these problems with p-values! P-values aren't inherently bad, they're just applied indiscriminately because It's The Way We Do Things. If we start indiscriminately applying test/train splits we'll end up at a different place with its own set of problems. The chief problem is the lack of discrimination, not the specific tools under consideration.)


Probably would be helpful to see the math on it to confirm, but I believe if it's a spurious relationship, the probability it will pass both the 1600 & the 400 are less than the prob of it passing the 2000.


It may (or may not) be less, but if scientists just discard the tests which don't pass both, then passing both becomes the new prior, and we go back to the same situation.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: