Hacker News new | comments | show | ask | jobs | submit login
Redefine statistical significance (nature.com)
303 points by arstin 43 days ago | hide | past | web | 114 comments | favorite

>For a wide range of common statistical tests, transitioning from a P value threshold of α = 0.05 to α = 0.005 while maintaining 80% power would require an increase in sample sizes of about 70%.

This proposal is a great pragmatic step forward. Like they say in the paper, it doesn't solve all problems, but it would be an improvement with reasonable cost and tremendous benefits.

>Such an increase means that fewer studies can be conducted using current experimental designs and budgets. But Fig. 2 shows the benefit: false positive rates would typically fall by factors greater than two. Hence, considerable resources would be saved by not performing future studies based on false premises.

In some fields like psychology, power is more likely to already be 10% or 20% for the majority of studies, and in fact P-hacking and low standards for evidence would be far less harmful if power were higher, because low power leads to inflated effect size estimates. Additionally, power calculations are always just a guess and easy to fudge, so it's pretty much a given that current statistical power would not be maintained with more stringent critical values. See http://andrewgelman.com/2014/11/17/power-06-looks-like-get-u...

So this proposal is really the opposite of pragmatic. Pragmatic would be requiring effect size estimates and confidence intervals in all published papers. It is surprising how many papers will talk about highly significant effects without actually discussing how large the estimated effect is thought to be, which gives authors a lot of leeway when exaggerating the importance of their findings.

So ultimately the issue is that push button statistics don't work?

Ultimately the issue is that we somehow got fixated on the p-value, which (roughly and imprecisely speaking) quantifies the probability that there is any effect, even a small one, rather than using effect size estimates which estimate both the magnitude of the observed effect and the uncertainty in that estimate.

Using p-values as our primary metric means an overemphasis on finding small effects (which are usually not clinically relevant anyway) and unduly low focus on things with big effects.

If an effect is real, but very small, that too may well cause replicability problems because it suggests the effect may not be very robust to small changes in experimental conditions, whereas a big effect would be more likely to be robust.

If you think about the really important scientific findings -- the ones that made a big impact and are indisputably true -- statistics usually aren't necessary to prove them, because the effect size is so large it is simply obvious. I'm not against using statistics anyway, of course, but the point is that we should be looking mainly for effects with big effect sizes if we are after important findings, IMO. It is only a major bonus that big effect sizes are most likely to be replicable.

The cost of increasing sample size is significant; this is a trade-off that allows smaller projects to still conduct valuable research.

Studies with low significance are no better than anecdotes.

Well, Bayes theorem should let us, in theory, combine several small studies into the equivalent of one larger one.

I'm practice it's harder than simple number crunching due to methodological differences between studies, but it seems a shame to call a study essentially worthless.

I like to be a little optimistic and think, "Data is data and data is good. The challenge is in interpreting it."

> "Data is data and data is good"

This is a reasonable intuition, but is hard to square with the increased risk of publication bias for small studies (i.e. we should question whether the typical small study published is really "data" in the sense we'd like).

Since small studies are more sensitive to noise and "fishing" for an effect, and since non-results are rarely published, what happens with meta-analysis of many published small studies is that you end up primarily looking at outliers rather than typical results. Since most research questions have a directionality to what counts as "interesting", the outliers also tend to be clustered on one side rather than evenly split.

The two papers I recommend anyone read if they care about these problems are:

Why Most Published Research Findings are False


The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time


There is plenty of evidence that people who recreate results often tries until they get the same result as the original paper. For example a scientist made a mistake when counting the number of chromosomes and reported that humans had 48, a number that was replicated for over 30 years. In this case it would obviously have been better if the original paper was never published.



Alt-Text: "Correlation doesn't imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing 'look over there'."

I think you could replace correlation with "small studies" or "anecdotes". These are all things that suggest there may be some effect, and there may be merit in further study.

I agree that anecdotes are useful, I disagree with publishing them in scientific journals and telling the public that we found significant results.

This is not true at all. Small studies are simply less accurate. Would you prefer to conduct further research on a cure for cancer that's been shown to lead to recovery in 0-50% of cases (the confidence interval includes 0%, so not significant) or in .5-1% of cases (significant)?

As a matter of fact, I can do even better than 0-50% of cases! Consuming widgets has been shown to lead to recovery in 0-100% of cases! It isn't as simple as you suggest.

Yep, and that makes widgets a more interesting target for follow-up studies than a treatment that has been shown to result in no clinically significant improvement. It really is that simple.

Surely we could create a system to fund extensions of studies with promising initial results? I just think that it is dangerous to call 95% significant.

How does it make sense to suppose it could lead to recovery in 50% of cases, if a drug that does nothing has a 5% chance of producing the same result in your study?

Because it has a 95% chance of producing a different (better) result. Statistics is all about the quantification of uncertainty.

That's true 95% of the time.

Do you believe that 95% of studies proving that humans possess paranormal abilities are true?

almost all problems has to do with data and selection of data. changing the p-value threshold wouldn't help.

I upvoted you in principle, but working with a tighter threshold would also make choosing self-serving data samples more obvious, if not more difficult.

The real solution is social scientists using statistics properly. Data mining 0.005 p-values isn't practically speaking much harder than 0.05. And some of these clowns are actually data mining on purpose; I've listened to them talking about it.

This is fine, but without other simultaneous changes, will do harm to young scientists. We need credit for publishing null results, or stop judgment on the basis of publication number. Would lead to larger, more well powered studies (good), but this tends to lead to acquiring multiple measures which can be inappropriately data-mined, and leads to large grants to established investigators, but fewer grants to new investigators.

Definitely. The main problem is that in the current system no one is being rewarded for good science, but for showing something interesting, bolstered by a declaration of (statistical) significance. The incentives are not aligned with societal objectives.

Good science requires a tension between hypothesis generation and skepticism. Perhaps if we rewarded the _debunking_ of findings as much as we do the discovery of findings, things would change.

Why doesn't this happen already.

The funding bodies etc, who want "quantitive" measures of research look at publications. Why would we expect debunking papers be published if they are debunking something interesting?

Once you have debunked something interesting I suppose you don't have trouble getting in published. But you have a hard time writing a grant application that reads something like "I want to replicate study X, no new results are expected."

I think this hits the nail on the head.

That said, it also seems like low hanging fruit. At least in some fields, replication should be a lot cheaper than doing things in the first place -- because a lot of the cost in pissing about trying to find something that even seems to work.

For funding bodies to explicitly support replication studies, even if each gets only 20% of the amount the original studies get should be a winner at reasonably low cost.

No, because they want to present the conclusions of each project as a "fact" that we now know for certain, as in a constant march of progress. They feel it reflects poorly on them if they published/funded something incorrect, so they come up with any excuse to not publish any critique or do a retraction. Eg, Andrew Gelman has some pretty good stories about this:

"In the review process they did not disagree with my points at all, but they refused to publish the correction on the grounds that they only publish the very best submissions to ASR." http://andrewgelman.com/2016/02/22/its-too-hard-to-publish-c...

It can hardly hurt, but it is still a stop gap measure. It won't solve the publication bias people will still change the hypothesis or the test after measurements are done.

I think the situation would improve with better teaching of philosophy of science and statistics (this would educate better reviewers too).

Agree. It doesn't stop p-hacking, it just makes it harder. Definitely treating the symptom instead of the disease. We ultimately need institutional and cultural change, but it's not obvious how to do that in the short term, so making it harder to claim significance might be a step in the right direction.

On the other hand, you might expect that new discoveries by nature have less data since data is likely more expensive for brand new research, and by extension a lower likelihood of meeting these sorts of stringent statistical requirements. Decreasing the p-value threshold may be counter-productive if we dismiss legitimate new discoveries due to essentially economic constraints with data gathering, which would have the impact of making it less likely to get funding to pursue the problem in more depth, thereby slowing the advance of discoveries.

> It doesn't stop p-hacking, it just makes it harder.

I could see the reverse happening, where higher p-value standards lead to normalization of deviance in the form of worse p-hacking.

I agree with that. In my prediction legitimate, honest research with p~0.05 will become unpublishable, while p-hacked bullshit will prevail in glory.

If you do a honest preliminary study and get a p~0.05 result, then repeating the study with 70% larger sample size should get you p~0.005; but if you've p-hacked, it won't.

But sometimes getting 70% larger sample will cost you so much money, that you'd need another grant, one you'd get after you'd publish your p=0.05 result, which given the new rules you'd not be able to publish...

Statistics is just another method and almost every method can be hacked or abused. Science is not about putting checkmarks in tables but about reading and understanding ideas and reproducing the results. Tweaking some numeric values is not going to help the review process which is fundamentally broken these days.

This is addressed in the paper. The additional costs of larger sample sizes outweigh the costs of studies based on false positives.

It can hurt, in that it can slow the spread of information. If you perform 70% fewer different types of experiments because you have to hit p=0.005 instead of p=0.05, then you explore in fewer directions.

This is a classic tradeoff between exploration and exploitation in active learning.

If your view of the world is that there are only a very few hypotheses worth exploring, and you have a good lay of the scientific land, then requiring higher bar of proof is probably good.

If it's a new field that's extremely complex and where very little is known of the governing principles, then requiring very high stats could severely slow progress and waste lots of research dollars.

I completely agree that rather than setting arbitrary barriers for significance, it would seem much better to let people actually understand what was found, at whatever significance it was. Even setting up the null model to get a p-value requires tons and assumptions. The better test is reproducibility and predictive models that can be validated or invalidated. That's where the science is, and not in the p.

> "It can hurt, in that it can slow the spread of information."

I am not at all in favor of this proposal, but one thing it may do is stem the tidal wave of misinformation.

Yours is a theoretical concern.

The very practical concern is that entire areas of research have been based on studies replicated and backed up entirely through p-hacking and selectively publishing only papers with positive results. This is a proven issue today. See https://en.wikipedia.org/wiki/Replication_crisis for more.

It may be that there is a pendulum that needs to swing a few times to get to a good tradeoff. But it is clear, now, which direction it needs to swing.

I disagree 100%, having read that Wikipedia page and its sources.

It's something that affects a few fields, not all science. And the problem has been completely 100% overblown.

If the problem is that things aren't replicating, changing the p-value cutoff for significance isn't going to fix everything. It can just as easily be a bad null model that's the problem,in which case you can't trust any p-value. The MRI scan problem was closely related to that.

It's a field-specific and null-model specific thing. Broadly changing the a p-value cutoff for everybody isn't going to fix this issue.

>"And the problem has been completely 100% overblown"

Just the fact that so few replications have been published indicates huge cultural problems. When I did biomed, in my tiny area of expertise there had been 1-2 thousand papers published since the 1980s. Out of these maybe 2-3 were close to direct replications. None of those showed the results were reproducible, but no one cared...

Usually there were "minor differences" in the methods so it resulted in stuff like: "Protein P has effect E by acting through receptor R in cell line L from animal A of sex S and age Y when in media M".

However if you changed L, A, S, Y, or M apparently totally different things were going on (there were then supposedly dozens of receptors for each ligand, each receptor having dozens of ligands in different circumstances, etc).

In the end I found that E was nearly perfectly correlated with the molecular weight of P (using data from one of the most cited papers on the topic, in which they specifically claimed there was no correlation with any physical properties of the ligands).

So the effect has nothing to do with specific ligand-receptor interactions at all, but no one cared. Situations like this (with few published direct replications, the ones that are published are contradictory, the results are all being misinterpreted anyway, and everyone just continues on their way when problems are pointed out) are totally standard for biomed. The replication aspect of the issue is really only the tip of the iceberg of problems.

> It won't solve the publication bias people will still change the hypothesis or the test after measurements are done.

As a Psychology student, this is a well-known initiative: https://cos.io/prereg/

(Though I can't confirm or deny its widespread usage.)

The publication bias is harder, and pre-registration won't solve this. But I think this is a separate issue, and it's important to address each issue in its own right.

I've seen the proposal from TFA before and with my very limited knowledge, I'm still fairly certain it will never come to pass in Psychology, as nearly half of all modern studies have reproducibility issues (!). It would be beneficial to our field, in the way that a band-aid is beneficial to a gaping wound, but it would require a lot more rigor than has been evidently been displayed so far (and more rigor is more work, and time is limited).

So... Don't hold your breath.

(Sorry if my comment sounds pessimistic, I don't know much and I'm open to being corrected. I still have enough critical thought to be skeptical of some researchers' dedication to intellectual rigor.)

>I think the situation would improve with better teaching of philosophy of science and statistics (this would educate better reviewers too).

This is necessary, but not sufficient. What's needed is a way to know for sure that the hypothesis was not changed after data collection. I think predeclaring the hypothesis is the way to go.

Yeah, but at the end you can still fabricate the data, remove "outliers",... Plus it's almost impossible to imagine a world where, before any experiment in any field, you predeclare it.

Not that education can fix all these (you can't prevent evil), but if reviewers and journals and conferences started to accept more the negative results, the incentive in lying would quickly decrease. And people would probably start to "disprove" interesting theories, instead of trying to "prove" niche results...

>Yeah, but at the end you can still fabricate the data, remove "outliers",...

Fabricating data is essentially fraud. And while it does happen, most of the problems with reproducibility are not problems of fraud.

It won't protect against outliers, but removing outliers will not solve most problems. It'll happen, but again, I don't think the majority of irreproducible studies are due to misuse of outliers.

>Plus it's almost impossible to imagine a world where, before any experiment in any field, you predeclare it.

Not at all. I'm not saying you predeclare every experiment - just every experiment you try to publish.

The way it works is:

1. You make observations (i.e. collect data - no predeclaring anything). If you see interesting patterns, you'll form a hypothesis.

2. This is the stage where you predeclare your hypothesis, and the criterion of falsification.

3. You now collect new data and test it against your hypothesis.

The hard part is ensuring people won't use some of the old data and claim they collected after their declaration. It's a hard problem, but not an impossible one.

People are in the habit these days of collecting a lot of data, seeing patterns, and publishing them. That's really not how a lot of early science was done. Once you see the patterns, you need to conduct more experiments to falsify them.

Genuine curiosity here. What's wrong with making an experiment, and when the results clearly contradict your initial assumptions (so much that the opposite is confirmed) then publishing the found results?

Because a very common type of hypothesis is along the lines of "gene X is important in disease Y" or "priming people with words will affect psychological outcome Z".

The opposite hypothesis is the null hypothesis which is "gene X is NOT important in disease Y" or "priming DOESN'T affect outcome Z".

Since we assume that most interventions will not affect most outcomes, these are much less surprising and interesting results. They are seen as "water is wet" type of findings and are thus hard to publish because no one is interested.

Now if your hypothesis is something like "X will cause Y to go up" and you actually find it causes Y to go down, that IS publishable. It is only when X has no effect on Y that you will have problems.

If you mean in the context of what I was saying: Nothing wrong. But if you found some interesting correlations, you should not publish it - you should predeclare the hypothesis and what you'll look for, collect more data, and then publish if the new data matches the hypothesis.

This is assuming predeclaring ever becomes the norm.

Not a huge fan of this idea. For example, people who analyze twitter data can get very small p-values because they analyze millions of tweets even though the effects they find are very small. See https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1336700

I'd rather hear about small things that are true than large things that are false.

The things is that these small things might just be noise from some confounding factors.


For example, here sample size is huge, USA population gets significant increased risk, while EU population does not. Mixing the two together would result in a smaller but still significant increased risk.

Given the size, it's quite clear that USA population has many other confounding factors that cannot be eliminated by mathematics alone (there is no control).

Statistical significance doesn't mean an effect is "true", or "real". So getting rid if it shouldn't cause you any issues.

The problem here is that you most likely hear about false positives.

I don't see the problem as long as you clearly separate significance and effect-size.

This bothers me a lot of media reporting. The headline will be something like "X is good/bad for you" with a tiny effect size but the way it's reported makes you assume it's a large effect size. Usually the effect size won't be discussed in any depth or at all, they just want to sum it up in black and white. If the effect size is tiny it's probably just noise that would disappear in a higher quality future study.

This is why I think better multiple-test-correction methods are more important and would lead better to a desirable outcome than just lowering alpha.

Multiple-test correction usually implies lowering alpha, doesn't it?

For any given test yes, but it usually tries to maintain your overall alpha at the same level.

Change p value from 0.05 to 0.005 won't stop p-hacking. And this might also lead to more grunted graduate students as they then will have to increase sample numbers to satisfy new test, which inevitably increase the already painful long time span for projects to get published

To be fair, this is a pragmatic, not a technical, solution. Similarly, we limit the speeds we allow in residential areas not because it prevents wreckless driving, but because it decreases the actual risk of it.

Similarly, the technical solution involves technology that does not require drivers and has no risk of human error anymore. The pragmatic solution is to just limit the acceptable speeds.

A bit of humorous pedantry: We seek to prevent reckless driving. "Wreckless" driving is what we're trying to promote.

Thanks for the pedantry. It is amusing to me, because the "w" looks way more correct to me. Can't say why, though.

Though reck is a word, it isn't commonly used, whereas wreck is a pretty elementary word. Reck and wreck are homophones. That's probably your answer.

Right. But to me the word "wreckless" even looks like what I want in this sentence. I think I take the meaning to be "as if wrecks weren't a thing."

As opposed to how the pedantry points out, that if "wreckless" were a word, it would actually be more of the opposite of "reckless."

I've not spent much time in academia, but it was my impression that p-hacking is driven primarily by ignorance rather than willful deceit. If that is indeed the case, it would indeed limit p-hacking as there's usually a finite number of variables being looked at.

Edit: or rather, it would limit false positives that show up as a result of accidental p-value hacking, if not the process itself.

Science would benefit from a little less noise.

Stop worshipping p-values set to an arbitrary threshold, whether it be 0.05 or 0.005, and start actually critically engaging with the statistics and results themselves.

Its time to ditch significance levels altogether and use Bayesian inference or analysis.

I'd rather say it's time to recall what significance levels actually meant to be and in which context they are useful and to ditch the contemporary aberration thereof.

Any game can be gambled. Bayesian statistics just isn't there yet.

I'm concerned that prior-hacking will become the new p-hacking.

So you require people report Bayes factors, not posteriors/priors. Those are invariant to the prior.

When can we have scientific papers formatted for the web? Reading pdf's with many tiny columns spread across each page puts me off reading so much.

Most journals have a HTML and a PDF version (as does Nature): https://www.nature.com/articles/s41562-017-0189-z

I prefer the PDF version for print outs.

Thanks you for this. I had no idea. Now I can actually read the paper!

This is an interesting point considering the World Wide Web was born from the need to share scientific info.

Can someone explain why this three-page article has 72 "authors"? That works out to about as much writing per author as this comment.

Given the kind of paper this is, I assume the names should be understood as an endorsement. Sorta like signatures on a petition.

Easy, in academia you can be the (co)author of a paper you've never even read.

No one in biology would be able to publish anything.

There are so many problems with this:

1) The p-value filter leads to publication bias.

-You should publish your results anyway, or the study wasn't designed/performed correctly. The raw data and description of methods should be valuable.

2) The null hypothesis is (almost) always false anyway.

-Everything in bio/psych/etc has a real (not spurious) non-zero correlation with everything else, so the significance level just determines how much data needs to be collected to reject it.

3) Rejection or not of the null hypothesis does not indicate whether the theory/explanation of interest is correct, so is inappropriate for deciding whether a result is interesting to begin with.

-Usually the null hypothesis is very precise and the "alternative statistical hypothesis" that maps to the research hypothesis is very vague, so many alternative research hypotheses may explain the results.

I would add that 4) Studies with a large p-value but that are not contradicted by others are much more valuable than studies that have a small p-value but all contradict each other.

Yes, these p-values are irrelevant in every way. What matters is that people figured out a way to get reproducible results. If I can be confident that an "effect" is smaller than x, that is good information. A bunch of studies with slightly different methods with small p-values in opposite directions don't give any confidence at all.

Imagine psychology... The end of a science.

Funny, I was thinking the opposite.

Imagine psychology... done properly. The beginning of a science.

(I realize that "beginning" is too harsh, but psychology does have very serious problems with replicability. At the moment, it deserves its tarnished reputation.)

it's too far away

Or it never was one

The concept of statistical significance is nonsense. In Bayesian statistics there is only evidence. A p value of 0.05 is roughly equivalent to a factor 20 of evidence. That means you multiply the odds you believe in a hypothesis by 20 (or add 13 decibels.) Similarly a p value of 0.005 is roughly equivalent to 200 units of evidence (23 decibels.)

But whether some amount of evidence is "significant" or not is entirely dependent on your prior. If you believe something has about a 50:50 chance of being true to start with, then a factor 20 of evidence is quite enough. Now you believe it 20:1 likely to be true.

But for something like xkcd's "green jelly beans cause cancer", your prior should be something like 1 to 100,000 or even smaller. After all, there are a lot of possible foods and a lot of possible diseases. Unless you believe a significant number of them are dangerous, your prior for any specific food causing any specific disease must be pretty low. And then even a factor 200 of evidence is nowhere near enough to convince me that green jelly beans cause cancer.

If it is nonsense you shouldn't be able to translate it coherently into various levels of evidence.

P values aren't nonsense and correlate with bayesian evidence. I think that interpreting levels of evidence as "significant" or not is nonsense.

It used to be possible to have have a successful academic career without publishing much - for example, one of my philosophy profs in college (at a top 10 school) had never published anything after his dissertation (he got his phd in the early 60s).

Of course, this system only worked because academia was a bastion of the male WASP elites that didn't have much pretense of serving the broader public. But at least you didn't have the torrent of mediocre papers that you see today.

> academia was a bastion of the male WASP elites that didn't have much pretense of serving the broader public

Have things really changed? I suspect there are fewer males, but any job that demands 20 years of full-time concerted effort is likely to be dominated by men. Similarly, the western world is overwhelmingly caucasian, so again... the best predictor (now as then) is that white male professors will be represented disproportionately.

> at least you didn't have the torrent of mediocre papers that you see today.

That certainly is true. Stats for the humanities and social sciences are that 80% of the papers have zero citations. i.e. they have no contribution to the greater body of human work.

In Physics (my background), most papers have 2-3 citations, and only a small percentage have 1 or fewer.

I would say that if a discipline is dominated by uncited papers, then that discipline is probably a waste of time. And the professors who work in it are a net drain on society.

As a note, WASP refers to old money families with ties going back to the colonial era, not just middle-class/wealthy white dudes in America. Also, at least in STEM departments, you will see plenty of non-white names.

> In Physics (my background), most papers have 2-3 citations, and only a small percentage have 1 or fewer

Does that account for self-citations?

> WASP refers to old money families with ties going back to the colonial era

No, it refers to White Anglo-Saxon Protestants, an ethno-religious group that cuts across socio-economic class divides and includes plenty of people that are neither old money nor descended from families that have been in the US since the founding (and excludes some old-money, from-the-founding families.)

It does exclude many large subpopulations from the many immigration waves coming after the colonial era. All the irish, italian, polish, jewish people; the vast majority of 19th and early 20th century (very large!) immigrants and their descendants aren't WASPs.

An equally plausible interpretation is that universities have been transformed from teaching institutions into paper factories.

So... what the academics did wasnt very helpful, but at least they didn't do much of it?

Those ones teached. Wether is was helpful or not depend on how good was the teaching and how useful is the knowledge. Not everybody in a teaching institution should be required to push humanity's knowledge forward.

But then you get the problem of selecting those people without easily measured objective indicators. That's why it worked reasonably well when those were slightly low paying jobs restricted to a caste.

> For a wide range of common statistical tests, transitioning from a P value threshold of α = 0.05 to α = 0.005 while maintaining 80% power would require an increase in sample sizes of about 70%.

This seems unintuitive and the claim is unreferenced. Can anyone explain why this is the case (if true)?

I've had some luck showing John Kruschke's Bayesian estimation supersedes the t-test (BEST) and this simple demonstration http://www.sumsar.net/best_online/ to people.

> The choice of any particular threshold is arbitrary [...]

Sounds scientific, doesn't it?

> [...] we judge to be reasonable.

And tomorrow someone else judges it differently?

Maybe they should not try to redefine significance but simply introduce something called 'well-reproducible' or so.

Curious if anyone's done any work to determine if changing the P-value threshold for e.g. Psychology studies (as they call out Psychology in particular) measurably affected replicability with p > 0.005?

There's overlap in authorship with this paper: http://www.sciencemag.org/content/349/6251/aac4716

Just curious: would this have an effect on testing the efficacy of new drugs? I'd hate to see a false negative result for a drug that could actually help people...

Animal experiments will get A LOT more expensive. Will there be a concomitant increase in agency funding to offset the increased costs?

They do briefly mention "the relative cost of type I versus type II errors". Both errors (Type I - false positive, Type II - false negative) have some cost associated.

Money saved by using a small sample size is wasted trying to replicate a false positive result and by groups around the world that rely on that false result.

Requiring larger sample sizes would mean fewer experiments are carried out but we will have more confidence in the positive results produced. The outcome is fewer experiments wasted on following up on false positives. None of this requires a change in funding.

I really don't think the proposal to do "fewer, but better" experiments work with animal studies. They are so expensive and so complicated and so much work and only answer singular, small questions that you almost always need a ton of further follow up work.

For instance, in the field I work in you have to spend days to months waiting for tumors to grow and then go and treat the animals every day for a couple of weeks with an IV drug (weekends too!). That is a a lot of work and at the end only tells you one piece of information about the drug: does it slow tumor growth in this one experimental model. It may in fact do that -- and you may get a really great p-value if you increase the number of mice -- but you still need to study the drug's pharmacokinetics, tissue distribution, in vivo mechanism of action (assuming you already know the in vitro mechanism of action). These are not just optional experiments that we require today to publish: this kind of work is essential to presenting a story about a new drug. It's not just about what it does, but how it works and universalizable it is.

This still doesn't feel satisfying. Part of me is still not really happy with the philosophical foundations of statistics. Does anyone know of any legitimately competing theory to statistics? Maybe something that doesn't rest on the same types of mathematics that Fisher and crew relied on when all this started? Pure mathematics has come a long way in the last fifty years but little has seeped into the applied world.

How & why exactly are you unhappy with statistics?

Uh oh. If you don't watch out you'll end up a Bayesian. http://www.stat.columbia.edu/~gelman/research/unpublished/p_...

Is this along the lines of what you were hoping to find? Here's more: http://andrewgelman.com/2016/12/13/bayesian-statistics-whats...

It's likely that the problems with statistics are inherently due to the nature of knowledge, so alternative formulations are not likely to help much.

Statistics arises from a set of axioms, assumed truths, which can be used to prove all other things in the field.

You can take a look at the three axioms people use to justify statistics. If you are willing to accept them, all else that relies on them (without using new axioms) must be true:


This same logic is used to justify development in pure mathematics: choose a set of axioms which you accept as ground truths, and prove things using them. As long as you are unable to prove your axioms are contradictory, and the axiom choice seems acceptable, then the work that you've done (with respect to them) is philosophically justified.

Statistics and probability are different things. I'm fine with the foundations of probability.

Just for reference, not everyone is OK with the foundations of probability - what you might call "conventional mathematical probability" as axiomatized by Kolmogorov. See http://www2.idsia.ch/cms/isipta-ecsqaru/ for the most recent in a series of workshops.

One entry into this set of ideas is what Peter Walley has called the "Bayesian dogma of precision" - that every event has a precise probability, that every outcome has a known cost. There are real-world situations when these probabilities cannot be assessed, or may not even exist; same for utilities.

Some examples are in betting and markets (asymmetric information, bounded rationality), and in complex simulation environments having so many parameters and encoded physics that the interpretation of their probabilistic predictions is unclear.

Please don't treat probability and statistics as one.

> Part of me is still not really happy with the philosophical foundations of statistics.

You mean you aren't happy with...probability?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact