
Psychology Journal Bans Significance Testing - tokenadult
http://www.sciencebasedmedicine.org/psychology-journal-bans-significance-testing/
======
CountBayesie
>The type of analysis being banned is often called a frequentist analysis

I find that there is a trend of associating "bad statistics" with
"Frequentists Statistics" which isn't really fair. If you found a statistician
trained only in Frequentist methods and asked their opinion on experiment
design in psychological research they would likely be just as appalled as any
Bayesian.

I'm a big fan of Bayesian methods, but the solutions of "we'll solve the
problem of misunderstanding p-values by removing them!" is still a problem of
misunderstanding p-values! The misunderstanding is the issue, not the p-value.

~~~
stdbrouw
The problem is that p-values are begging to be misunderstood, and in fact you
cannot use them as a decisionmaking procedure without "misinterpreting" them –
after all, you're deciding whether to accept the hypothesis P(HA|D) based on
1-P(D|H0) on the grounds that, while they're not the same, they're
proportional. (In that sense the p-value is like the poor man's likelihood
ratio.) There's nothing wrong with p-values as a concept, but there's
everything wrong with p-values in hypothesis testing. The misunderstanding is
baked in.

~~~
im3w1l
You can update your posterior based on the p-values yourself though. "Well
those eggheads may have disproved X, but X is just common sense, so I'm gonna
keep believing it anyway. U-until I see more studies confirming the finding I
mean."

------
stdbrouw
Banning p-values makes sense to me, as they force you to declare an effect as
either significant or not significant, rather than looking at the
preponderance of the evidence and building knowledge over multiple
experiments. It also leads us to focus on statistical uncertainty at the
expense of all of the other kinds of uncertainty researchers are faced with:
do I have the theory to back this up, am I actually measuring what I am trying
to measure, is this coherent with other findings in the field? I do think the
editors might be right when they say banning p-values will make the quality of
the research go up, not down.

But if you read the original editorial, up at
[http://www.tandfonline.com/doi/pdf/10.1080/01973533.2015.101...](http://www.tandfonline.com/doi/pdf/10.1080/01973533.2015.1012991),
you can see that they also reject confidence intervals and Bayesian reasoning
with uniform priors (which is really the same thing) without providing any
guidance at all on better procedures. I fear that will just lead readers to
try and guess the reliability of the data themselves, or worse, interpret the
sample statistics as numbers without any associated uncertainty.

So they're doing away with poor statistical procedures, but at what cost? It's
like that old joke: we've found a 100% reliable cure for cancer – bombing the
planet until everybody's dead.

~~~
nabla9
Methodological flexibility in statistical modeling has gone trough the roof.
New tools and estimation methods make Bayesian methods easy to use: Jags and
Stan, Hamiltonian Monte Carlo, the variational Bayesian approximations,
expectation propagation and even probabilistic programming (note: Bayesian in
this context does not mean subjectivity. It's the ability to quantify
uncertainty in the results and increased flexibility in the modeling).

New methods and tools are faster to use and give better results, but they
require more statistical knowledge. If the scientist applying these methods or
peer review can't understand the advances, it's all for nothing. Many sciences
are methodologically very conservative to the extent that it holds the science
back.

How do you increase the statistical knowledge of the field so much that peer
review and researchers can be expected to understand and use the new methods
if they can't be trusted with p-value?

------
rcthompson
I spent about a hour explaining p-values to a fellow graduate-level researcher
a few weeks ago. I pretty much just kept rephrasing the definition in slightly
different ways until the person finally got it. In undergrad, hypothesis
testing was more or less taught as "do this inscrutable calculation and if the
result is 0.05 or less, you win". The point is, in my experience, a lot of
people really don't get p-values, even graduate-level and post-graduate
scientists.

~~~
NamTaf
Let me replay it to you and see if I understand it, because I don't know if I
do:

You assume a null hypothesis that (usually) represents the status quo of no
influence between the theory and the data. You then collect data. The p value
then describes the probability of that data aligning with / being as a result
of the null hypothesis. In other words, a p <= 0.05 says that you have <5%
chance that the data came from the theory stated in the null hypothesis - that
is, you have a 95% confidence that you can reject the null hypothesis in
favour of your new theory.

Is that correct? I may have minced terms there because my stats training is
woefully inadequate, but I think I adequately conveyed the concept?

~~~
yummyfajitas
Not quite. The 5% represents the chance that _if the null hypothesis is true_
, you would draw data at least as extreme as the data you just saw in a
repeated experiment.

Computing the probability that the data came from the theory stated in the
null hypothesis would require a (Baysian) prior.

Also, Tloewald's reply is completely and inexorably wrong. Tloewald seems to
want a Bayesian answer, which frequentist statistics can't give you.

~~~
GhotiFish
could you explain the distinction between saying:

"Given the evidence, there is a >=5% probability of the null hypothesis being
true"

and

"There is a >=5% probability that if the null hypothesis were true, that your
data would be at least as extreme"

The only difference I see is how you avoided saying anything about the null
hypothesis, but I don't see how you can avoid saying anything about it.

if the h0 were true, then the probability of the result is unlikely, how can
you not conclude that h0 is unlikely? What step are you missing other than
collecting a preponderance of evidence against it?

The article never enters into this distinction. It makes it clear that people
misinterpret evidence against the null hypothesis as evidence of the
alternative, which is a false dichotomy.

I am confused. I also have sympathy for Tloewald at this point.

~~~
yummyfajitas
Sure. Let H0 be the null hypothesis and D be the data you observed. The first
statement is P(H0|D) = 0.05. The second is P(D|H0) = 0.05.

The two quantities are related to each other via Bayes rule:

P(H0|D)=P(D|H0)P(H0)/P(D)

So indeed, as P(D|H0) goes down, so does P(H0|D). But if P(H0)/P(D) is
sufficiently large, you can easily have P(H0|D) high while P(D|H0) is low.

I too have sympathy for everyone confused by frequentist stats - they tend to
answer the exact opposite question that one really wants answered. In
contrast, Bayesian stats tend to answer the question that most people ask.

~~~
GhotiFish
Could you clear up some notation for me?

What does P(D) mean?

I read that is, the probability of Data being true.

edit: to clear up my meaning.

I mean, it makes sense to me to ask "What is the probability of getting this
data, given that the null hypothesis is true"

and "what is the probability of the null hypothesis being true, given this
data"

but I don't know how "this data" evaluates on its own. I can't picture that

does it mean, how authoritative is the data? Maybe that's it.

edit: OK never mind I kinda worked it out on my own.

~~~
yummyfajitas
P(D) is the probability of observing the data you just saw, due to either the
null or non-null hypothesis. It's a strictly Bayesian quantity, since it's
dependent on a prior. If your model has only a null and alternative
hypothesis, then:

    
    
        P(D) = P(D|H0)P(H0) + P(D|H1)P(H1)

------
phren0logy
There was a question yesterday about Evidence Based Medicine vs Science Based
Medicine. The SBM criticism of EBM is the over-reliance on Randomized
Controlled Trials that meet p=0.05, without looking at the prior probability
that a treatment would help.

For example, EMB would say that if you have an RCT that shows that a lucky
rabbit's foot works, then you have reasonable evidence to put that into
practice. The issue is that, as this article points out, even with a
statistically significant result for such research, there's no plausible
mechanism by which a rabbit's foot makes you lucky. Therefore the RCT is just
one part of the whole picture, and subject to a specific type of manipulation.

~~~
Alex3917
> The SBM criticism of EBM is the over-reliance on Randomized Controlled
> Trials that meet p=0.05, without looking at the prior probability that a
> treatment would help.

The strength of significance testing is that it purposely doesn't try to tell
you how likely something is to be true, only how likely the data you got was
the result of chance assuming the treatment is no better than placebo. You're
still taking the prior probability into account when trying to figure out the
truth, you're just not putting a number on it.

My concern with bayesian approaches is that, like with frequentist approaches,
the truth is still fundamentally unknowable, only now you're encouraged to put
a number on that and pretend that it's science. While bayesian approaches
totally make sense in trying to determine a patient's likelihood of having
some disease when there is already data available for the prevalence in a
population and the sensitivity and specificity of the tests, using bayesian
logic to weight clinical trials strikes me as being highly dubious.

It would be one thing if SBM actually developed a framework to give a weight
to each methodological feature of a trial, but so far I haven't seen much work
to build a functioning system. Though if you're really honest about all the
ways that you can have positive results without something actually being true,
it seems like almost no amount of research will ever have a significant effect
on the prior.

~~~
yummyfajitas
_You 're still taking the prior probability into account when trying to figure
out the truth, you're just not putting a number on it._

This is a danger sign - you are doing the same things the Bayesians do, just
informally, less explicitly, and probably incorrectly.

The fact is that to make a good decision, eventually you need to compute a
single number. This is an elementary fact of topology:

[https://www.chrisstucchio.com/blog/2014/topology_of_decision...](https://www.chrisstucchio.com/blog/2014/topology_of_decisionmaking.html)

That number will be based on some unproveable assumptions. That's a fact of
Godel's incompleteness theorem, if nothing else. So given this, why is it
"dubious" to make those assumptions explicit and obvious?

~~~
algorias
> The fact is that to make a good decision, eventually you need to compute a
> single number.

Your linked blog post states that if you make a good decision, then there is a
process computing a single number which is equivalent to your process. This is
not equivalent to what you claim. As a matter of fact, it's the same kind of
confusion that exists around the p-value.

It's not the case that a process explicitly computing such a number
automatically makes good decisions, which is what you seem to claim
implicitly.

Also, Gödel has nothing whatsoever to to with this.

~~~
yummyfajitas
Eventually you need to compute a number which is either above or below your
go/no go threshold. That's the number I'm referring to.

I don't claim you can't arrive at it by some perfect heuristic. I merely claim
that you are better off being explicit about your assumptions and formalizing
your reasoning. That just makes mistakes more obvious, makes your strong
assumptions more clear, and makes it more likely that you will correctly
update your beliefs rather than incorrectly discounting/overvaluing evidence.

You are right about godel, it's a separate theorem I'm referring to which says
you need unproveable axioms. I misremembered, sorry, wrote that before my
coffee.

~~~
algorias
I see where you're coming from, and I agree with you in large part, specially
about making your assumptions explicit.

However, I think it's important to notice that an explicit formula for your
thought processes can be difficult (computationally expensive) to find. Our
brains have evolved to use heuristics and "gut feelings" to make decisions,
and the approach you propose forces you to throw all that away and use the
much slower general purpose processing part of your brain to emulate those
processes. So there's a tradeoff there.

------
mdbco
The article is certainly correct that p-values and confidence intervals (or
confidence sets, in multi-dimensional contexts) are widely misunderstood, not
just in psychology or other social sciences, but in the hard sciences as well.
The problem is even worse when you look outside of academia at common
practices in more applied settings.

As suggested, a good approach is to take p-values not as conclusive or
decisive, but rather as a tool that must be supplemented by other statistics.
In particular, the article emphasizes Bayesian methods, which can certainly
provide additional information, but this approach can also be rather limited
when priors are not well-defined or are entirely unknown, which is
unfortunately often the case in many problem domains.

One potential question is how to determine the nature of the distinction
mentioned in the conclusion between "preliminary research" and "confirmatory
research", particularly in cases where statistics provide the primary
evidence, as in, e.g. psychology. Further studies in the same vein as the
preliminary research can certainly provide additional supporting statistical
evidence, but this doesn't escape the problem that all of the evidence is
probabilistic in nature. The key issue here is that since statistical
approaches can only give probabilistic evidence that a hypothesis is correct,
then they strictly cannot tell you what is certainly true, so even
confirmatory research is quite open to falsification. So we wouldn't want the
label of "confirmatory research" to somehow suggest to the public the idea
that it is certainly correct.

~~~
esfandia
Is the cautious approach then to treat a p-value in the absence of priors on
the same level as a p-value in presence of unfavorable priors? When someone
tests positive for a cancer test, the priors are known (probability of cancer
in the general population is usually very low, and the false positive rate of
the test may be relatively high), and so usually that first test is merely
indication that further tests are needed. So when you don't know the prior and
you observe a low p-value on something, isn't that just "preliminary research"
that needs to be further confirmed with other methods or at least the same
test but using other data?

~~~
mdbco
_> Is the cautious approach then to treat a p-value in the absence of priors
on the same level as a p-value in presence of unfavorable priors?_

In the presence of a poor prior the Bayesian probability would be biased in
some way, so frequentists would say that the p-value in the absence of priors
is actually superior in this case. Bayesians would reply that if they thought
the prior might be poor then they would simply consider multiple different
priors, but it's not clear how this would improve things much over the
frequentist approach that simply assumes that the prior is unknown.

 _> So when you don't know the prior and you observe a low p-value on
something, isn't that just "preliminary research" that needs to be further
confirmed with other methods or at least the same test but using other data?_

Yes, when you observe a p-value with low significance it should definitely
indicate to you that more testing is necessary, either by using different
testing methods, gathering new samples, or even just increasing the original
sample size if that's possible. What I was trying to suggest in my last
paragraph was that this should be the case even when we have highly
significant p-values, because even significant p-values are not decisive. So
even when we have "confirmatory research" that is highly statistically
significant, we should still do all of the things that we would do when we
have a p-value with low significance. It is sometimes the case that this
subsequent research will overturn even very highly statistically significant
results (though often this is unfortunately because mistakes in the original
statistical methodology are uncovered).

------
MollyR
I might be missing the forest from the tree's.

The paper talks about how it seems researches are "hacking" (their word)
p-values.If the researchers lack the ethics using one form of statistics, what
is really stopping them from misusing Bayesian Analysis?

Sidenote: While we are talking about bayesian stuff. I recently ran into the
sleeping beauty problem (
[http://en.wikipedia.org/wiki/Sleeping_Beauty_problem](http://en.wikipedia.org/wiki/Sleeping_Beauty_problem)
) and it showed how different interpretations of an experiment can lead to
people believing in two very different answers in an almost religious way.
Some people argued this thought experiment shows a clear flaw in bayesian
thinking. I'm not sure either way, I'm still thinking about the puzzle.

~~~
Gravityloss
That's a cool problem. It took a while to understand the angle.

I'd say I'm a halfer. I think the extreme sleeping beauty problem highlights
how if you're woken up, it's not any more probable it's a head or a tails
awakening, since, for a reason I can't explain, the million tails awakenings
"don't accumulate".

In a sense, it's like the opposite of the Monty Hall problem, since here the
sleeping beauty receives no information whatsoever during the experiment.

~~~
MollyR
I'd say I'm a halfer too, but a lot of physicists/philosophers say 1/3, and it
really bothers me that I can't seem to sync up with world leaders.

source : [http://rfcwalters.blogspot.com/2014/08/the-sleeping-
beauty-p...](http://rfcwalters.blogspot.com/2014/08/the-sleeping-beauty-
problem-how.html)

------
cafebeen
Here's the key passage:

<quote> BASP will require strong descriptive statistics, including effect
sizes. We also encourage the presentation of frequency or distributional data
when this is feasible. Finally, we encourage the use of larger sample sizes
than is typical in much psychology research, because as the sample size
increases, descriptive statistics become increasingly stable and sampling
error is less of a problem. </quote>

In other words: report the effect size, plot the data, and increase the N.

------
MengerSponge
Good! This is a big step in improving everybody's work. If you aren't up on
this research, here are the two articles that you need to read.

Why Most Published Research Findings are False:
[http://journals.plos.org/plosmedicine/article?id=10.1371/jou...](http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124)

Revised Standards for Statistical Evidence:
[http://www.pnas.org/content/110/48/19313.short](http://www.pnas.org/content/110/48/19313.short)

------
Kenji
The problem is this: You need a solid background in statistics and some
mathematics to perform these tests properly. Most apparently, people from that
field lack these skills. If someone can't hit a nail on the head with a
hammer, blaming and banning the use of hammers won't solve a thing.

~~~
manicdee
Taking hammers away will solve the problem of holes in things which aren't
supposed to have them, including thumbs, walls, people's heads, and candy
jars.

~~~
zmjones
there are just other tools that will get abused. bayes factors for example

------
po
The audio is a bit quiet but the 'dance of the p values' video [1] linked to
by the article is absolutely fantastic and well worth watching.

[1]
[https://www.youtube.com/watch?v=ez4DgdurRPg](https://www.youtube.com/watch?v=ez4DgdurRPg)

------
chrisseaton
I never understood where 0.05 came from. It seems like an arbitrary, magic
number. Aren't those bad in science? Shouldn't we have some reason for every
number we use? Why 0.05 instead of 0.049?

~~~
nzp
It is somewhat arbitrary. 0.05 corresponds to roughly 2, and 0.01 to roughly 3
standard deviations under the normal distribution. In high energy physics, for
example, in the past a result with 3 sigma significance was considered “a
discovery”, but it turned out there were a lot of false signals, so it was
upped to 5 sigma (~ 3 x 10^{-7} percent chance of getting the effect assuming
there is no effect in reality), while 3 sigma (or 4) result is considered
“evidence of effect“. Unfortunately in softer sciences, considering their
relatively generally poor state of research standards, we have no idea if 0.05
or 0.01 levels really are good standards of “discovery”.

------
Booktrope
If the stats aren't based on samples of real-world data, rather than very
rigorously designed experimental comparisons, I think there's another reason
to be wary of p-values. [Note, in what follows I probably use some terminology
wrong, because I'm not a statistician, but I do think the point is important
and I don't see much written about it.] In the real world data is not a bunch
of independent events, but events (or data points) that are interconnected in
ways that are difficult to quantify. I once heard a presentation by an expert
in AB testing at a leading tech company who became wary of his results and
brought his concerns about non-independence of events that were being tested
to corporate statisticians. (He was concerned that, even though the AB testing
procedure supposedly randomized the test, interconnectedness within website
traffic was not being accounted for.) By his account, the statisticians agreed
it was a problem but recommended that he assume that variations due to non-
independence would more or less balance each other out. He wasn't satisfied
with this and said that he went ahead and did some more measurements, then ran
out all the binomial expansions rather than relying on approximations. When he
did this more detailed work, he found out that with at least some web-based AB
tests, where conventional statistical formulas showed a p-value of .05, he was
getting a confidence level of more like 30%. (I don't think you could call his
measurement a p-value because he wasn't using the formulas normally used to
compute p-values, but I think what he was saying was more or less that the
p-value from the forumalae was .05 when it should have been .30 from a more
rigorous look)

As I'm not a statistician by trade I don't keep up on the literature very
well, but interconnectedness of data does seem to me to be a very important
issue. I'm wondering if anyone can point me to some helpful reading to
understand this side of the issue better. In particular, is there any approach
to AB testing that can reliably address the issue of data interconnectedness
in the kind of situation described above?

~~~
stdbrouw
Could you expand on what you mean by "interconnectedness within website
traffic"?

If one person visiting the site has no influence on other people visiting the
site, then measurements of their behavior will be independent. If Facebook
tests a different interface on half of their users and the changed behavior of
those users indirectly has some impact on the behavior of the control group,
then your measurements would have some level of dependence – I can imagine
that this could happen but it's not clear to me that this would be a common
scenario. The same would happen if you measure the behavior of the same person
more than once – but in this case there's many procedures for working with
paired or autocorrelated data.

~~~
Booktrope
The presenter I mentioned did not go into details about what
interconnectedness he found, but I think it's quite obvious that people
visiting a site do have influence on other visitors, which is at least a part
of the underlying issue. On the simplest level, most web sites have share
buttons to make it as easy as possible for visitors to influence other
traffic. Or, other examples, a trending tweet can massively influence patterns
of usage of a website or Facebook page (or many tweets with small reach can
cause many small influences); or an RSS feed might influence patterns of
tweeting or posting elsewhere. I think there are myriad other interconnections
within web traffic. We do a great deal of work to drive traffic that is based
on the premise that different visitors to websites are mostly interconnected.
It's these factors that give me pause when I think about statistical measures
that are premised on an assumption that we're measuring independent events.

~~~
stdbrouw
Hmmm, again, there are certainly ways that interactions between visitors can
cause statistical dependence, but not in the specific case you mention. Let's
take an A/B test on a referral funnel. If a user invites all of his friends,
and his friends then visit the site, they will be randomized over A and B just
like the original user, and so any effect that is not due to changes in the
referral experience will simply not matter because it will contribute equally
to both groups.

Without better examples it's very hard to judge whether this is a real
problem.

~~~
Booktrope
I understand if you think this is a non-issue, though I don't agree. The
speaker I referenced about asked the statisticians at his company about this
and they said it was a non-issue because things balanced out. He thought that
was an idealization and claimed to have tested it building in some real world
data, and reported that interconnected data of this kind drastically affected
confidence levels. He didn't get into the details of how he measured
interconnectedness, however.

The example you give seems to me to oversimplify the issue of complex
interconnections between data points, as if the traffic on a real website came
from one set of referrals, while in reality its much more complex, with
referrers inducing other referrers and a variety of campaigns, postings, etc.
influencing each other, and over time, overlaid in a fairly complex pattern.
In other words, a bunch of interrelated data, very little of which is actually
independent of other items.

I'm not really asking for an explanation of this in the comment thread here;
what I'd like to know is, if there are any studies or other publications that
deal with the issue of how to evaluate tests run on interconnected data of
this kind.

~~~
stdbrouw
There are absolutely ways to deal with what you call interconnected data, as I
mentioned earlier: paired tests, corrections for autocorrelation,
nonparametric and bootstrap methods for non-normal data and so on. But barring
any examples of what you mean with interconnectedness in this context, it's
hard to recommend any studies or publications because there is no One Method
Of Interconnectedness Correction.

Also, statistics deals with many idealizations but the idea that randomization
allows you to cleanly measure the effect of an intervention in the face of
what would otherwise be confounding is simply not one of them. Sorry to
disappoint, but with all you're telling us it simply sounds like the speaker
was clueless.

~~~
Booktrope
Well, if he was clueless then two very large and successful tech companies had
a clueless guy running their AB testing and showing great results in each
context.

I'm certainly not looking for "One Method for Interconnectedness Correction"
(especially not, as you put it, with each word capitalized). I'm looking for
studies or papers that might have addressed anything like the effect of
interconnectedness of web data on AB testing. I think you're saying, you don't
know of any, and also that you personally don't think it's a real issue.

------
jwilliams
I think this Wikipedia section summed it up really well for me:
[http://en.wikipedia.org/wiki/P-value#Alternating_coin_flips](http://en.wikipedia.org/wiki/P-value#Alternating_coin_flips)

Extreme example, but explains how the context can shift the findings
significantly.

------
VieElm
I kind of which that the widely read publications, like newspapers, that have
turned out science articles based on all these now pretty much discredited
research articles would issue retractions or at least clarification for each
of the articles that has been challenged in this regard.

------
kriro
This is very interesting for me. I recently switched from AI to Human Computer
Interaction which is more cross discipline, specifically with a strong
influence from psychology. I'll gladly admit that I don't have the best
understanding of NHST but I think I understand it well enough.

Interestingly enough there was a post on HN somewhat recently about Bayesian
alternatives (BEST). The paper that was linked was:
ftp://ftp.sunet.se/pub/lang/CRAN/web/packages/BEST/vignettes/BEST.pdf

And the recommended book I settled on was by the author of that paper (Doing
Bayesian Data Analysis)

I feel like I'm "ahead of the curve" thanks to HN :D

------
analog31
I wonder if it just boils down to this: Exclusive reliance on any single tool
by an entire field for a long enough time period will eventually lead to a
proliferation of bad results.

~~~
rcthompson
More like exclusive reliance on a single tool will lead to a requirement for
everybody to use it, even if they don't understand how to use it properly,
which leads to both unintentional misuse (by experimenters who don't
understand it) and the inability to catch intentional misuse (because
adjudicators don't understand it either).

------
torgoguys
Obligatory xkcd: [http://xkcd.com/882/](http://xkcd.com/882/)

------
return0
i think this is also important on a conceptual level. There's just too much
literature in life sciences nowadays which, instead of formulating bold
hypotheses with clearly distinct effects, is happy to report miniscule effects
that just crossed the arbitrary significance barrier.

------
logicallee
the main reason to ban significance testing in all fields is because of
logicallee's getcher scientific results agency.

Our prices are:

$10,000 random study, no results guaranteed. Not recommended! Highly likely to
be damaging.

$20,000 basic study, p=0.5 - study inconclusive but implies it's at least not
"more likely" that the damaging/negative result is correct. (Assuming uniform
priors.) No scientific value.

$100,000 weak FUD. Suggests that the damaging/negative results (assuming
bayesian reasoning or uniform priors) may be incorrect, at a suggestive p<0.10
level. Not conclusive and invites further studies which can strengthen
damaging/negative result! Not recommended unless further studies are unlikely.
Consists of ten studies, one of which is published with remaining buried.
Unscientific.

$200,000 basic refutation. Refutes the damaging/negative result at a
statistically significant p<0.05. Likely to be referenced and accepted.
However, due to significance level, links to logicallee's getcher scientific
results agency should still be avoided! (Invites skepticism.) Scientific. May
take up to six months.

$500,000 Silver refutation. Highly significant refutation at p<0.02. Study can
be extremely rigorous and sponsorship can be public. The results should be
referred to and referenced as widely as possible. Unpublished results (the
other 49 studies in the series) to remain unpublished and unreferenced. Due to
the number of parallel studies to be involved, may take up to one year for
these results. Highly scientific.

$1,000,000 Gold refutation. Highly significant refutation at p<0.01, with
extreme amounts of data to be published. Should become the gold standard for
data in the field, and the most significant effect published. Full refutation
of damaging/negative result, with full scientific rigor. Should be widely
promoted. May take up to 24 months to produce data. Gold standard of science.

ACADEMIC BONUS: prices are free for tenured professors, thousands of whom can
do all the studies they want and only publish if they see some significant
effect.

EDIT:

in other words, [http://xkcd.com/882/](http://xkcd.com/882/)

