
Scientists Perturbed by Loss of Stat Tool to Sift Research Fudge from Fact - jonbaer
http://www.scientificamerican.com/article/scientists-perturbed-by-loss-of-stat-tool-to-sift-research-fudge-from-fact/
======
nkurz
Here's the editorial by David Trafimow and Michael Marks explaining the new
policy for their journal "Basic and Applied Social Psychology":
[http://www.tandfonline.com/doi/pdf/10.1080/01973533.2015.101...](http://www.tandfonline.com/doi/pdf/10.1080/01973533.2015.1012991)

And here's their concluding paragraph explaining their rationale and
objective:

    
    
      We conclude with one last thought. Some might view
      the NHSTP[1] ban as indicating that it will be easier to
      publish in BASP, or that less rigorous manuscripts will
      be acceptable. This is not so. On the contrary, we   
      believe that the p < .05 bar is too easy to pass and sometimes
      serves as an excuse for lower quality research. We hope
      and anticipate that banning the NHSTP will have the
      effect of increasing the quality of submitted manuscripts
      by liberating authors from the stultified structure of
      NHSTP thinking thereby eliminating an important
      obstacle to creative thinking. The NHSTP has dominated
      psychology for decades; we hope that by instituting
      the first NHSTP ban, we demonstrate that
      psychology does not need the crutch of the NHSTP,
      and that other journals follow suit.
    

[1] NHSTP = null hypothesis significance testing procedure

~~~
protonfish
And what should be used as an alternative? There is no reason to believe
banning p value would result in improved research quality. Desperate
researchers will just find another technique to game.

~~~
yummyfajitas
Instead of blindly applying a tool you don't understand, you'll need to build
a statistical model, explain why it's valid, and then construct a meaningful
measurement based on it.

Then the referee and reader will be required to understand it and will have
the ability to critique it.

~~~
disgruntledphd2
Which is good, as it will discourage people from submitting there, which means
more potential papers for those of us who got on this train a while ago.

------
thecopy
I think statistics is one of the most misunderstood field of mathematics
relative to how much the general person believes he or she knows about it. I
know this article was about researchers and not the general public, but I can
definitely sympathize with them; I have a MSc in engineering physics and I
still have to think three times about a number before I know that all
assumptions I have made while calculating it were correct and not biased, and
three times more about what this number means and what conclusions I can
actually draw from it.

~~~
CamperBob2
Absolutely. If I were in charge of curriculum development for engineers, I'd
swap out a couple of semesters of math that the student will never see again
unless they end up working on EM simulators, and replace it with statistics
course(s) that will be helpful throughout their careers.

Unfortunately there's a lot of coursework in engineering that amounts to
institutionalized hazing. The professor had to do it, so by the hallowed beard
of Frobenius, the students have to do it too...

~~~
CHY872
At one top British university, the first year Physics practicals are
universally reviled, everyone gets within a few marks of each other, the
experiments are trivial (roll a ball down a slope!), and at every meeting the
academics all agree that they'd prefer that they were dropped. Unfortunately
there's some kind of government requirement for practical work, so they stay.

~~~
trhway
>the experiments are trivial (roll a ball down a slope!)

they should replace it with trivial experiments of the 21st century - photon
counting for Bell inequalities testing.

Wrt. the original article - good riddance, one less orthodoxy in science.
Because it isn't about a tool - p-values in this case - itself, it is about
orthodoxy which is the main enemy of science.

~~~
megablast
Oh exactly, sounds like there is ways of getting around the government
mandate.

------
captainmuon
I find this whole backlash against p-values pretty confusing. That is probably
because I come from particle physics, where we also use a lot of statistics,
but in subtly different ways.

Hypothesis testing is not too hard [1]. You pick a cutoff, say p < 0.003 ("3
sigma"), and then if your p-value is below that, you call it evidence for your
signal - otherwise you just don't have evidence. By doing so, the probability
is 0.3% to have data this or more signal like, assuming there is no signal.
With other words, if you follow this prescription and are looking for
something that isn't there, in 0.3% of cases you will (wrongly) claim evidence
(error of the first kind).

Since we are a cautious bunch, we actually put the threshold for discovery at
5 sigma - p<0.0000003 - which sometimes gets us ridiculed by statisticians.
This hyper-strict standard shouldn't be necessary, but in part it's a hedge
against the case where you get your systematic errors wrong (you believe your
prediction for the null hypothesis is more accurate than it is - so if you see
a slight fluctuation, it seems to be many (wrong) standard deviations away).

One other thing that we have to take into account - and many people forget
this - is the look-elsewhere effect. If you perform one search, looking for
e.g. a Higgs Boson with a mass of 126 GeV, you expect N events in your
experiment if it is not there, and N+X if it is there. You know how N is
distributed, and the interpretation is straightforward. However if you perform
a scan, looking at 120, 121, 122, 123... GeV, then you have to adjust your
p-value, since you are basically performing a bunch of different experiments,
and by chance alone some of them are bound to turn up "significant".

The same thing applies when hundreds or thousands of Master and PhD students
and postdocs do their analyses - even if no one makes a mistake, some of them
will "find" a 3 sigma or larger effect that isn't there, just due to the sheer
number of independent statistical tests performed. I've "found" new particles
myself this way, but when you keep calm, put it into context by looking at
other analyses, and try to add more data, you'll often find that your result
melts away.

\------------

[1] explaining it is hard, and I will undoubtedly have messed up, especially
since I'm tired.

~~~
megablast
> which sometimes gets us ridiculed by statisticians

You get laughed at because you have this huge luxury at dissecting huge
amounts of data, running the experiment billions of times.

Do you understand that in most disciplines, you don't have that luxury?

~~~
hudibras
To begin with, you'll have to explain to the physicists that there are other
disciplines besides physics.

------
amateurpolymath
There has been plenty of debate about this in other fields as well. Deirdre
McCloskey and Stephen Ziliak have a particularly well-written paper titled
"The Cult of Statistical Significance" on this very topic. Their main point is
that statistical significance is meaningless without a discussion of
magnitudes.

[1]
[http://www.deirdremccloskey.com/docs/jsm.pdf](http://www.deirdremccloskey.com/docs/jsm.pdf)

------
Perceval
Looks like this may be Prof. Leek's course mentioned at the end of the
article:
[https://www.coursera.org/course/statinference](https://www.coursera.org/course/statinference)

Also there was previous HN post on p-values, which I found really interesting:
[https://news.ycombinator.com/item?id=9330076](https://news.ycombinator.com/item?id=9330076)

Also this _Nautilus_ article: [http://nautil.us/issue/4/the-unlikely/sciences-
significant-s...](http://nautil.us/issue/4/the-unlikely/sciences-significant-
stats-problem)

------
dthal
>>Several journals are trying a new approach...in which researchers publicly
“preregister” all their study analysis plans in advance. This gives them less
wiggle room to engage in the sort of unconscious—or even deliberate—p-hacking
that happens when researchers change their analyses in midstream to yield
results that are more statistically significant than they would be otherwise.
In exchange, researchers get priority for publishing the results of these
preregistered studies—even if they end up with a p-value that falls short of
the normal publishable standard.

It's not exactly the same issue as the one addressed by banning p-values, but
this would help a lot.

------
visos
This is a bad idea. Sure, p-test is pretty flawed, but this is like going
without an antivirus because the one you have has bad detection rates.

The researchers are not the only ones who could game the system. A bigger
problem is the editorial staff. Replacing an objective test, however bad, with
a nonspecific 'case by case' criteria opens the door for nepotism and
political agenda pushing. Psychology is an especially dangerous field for
this, with the potential to label entire groups of people with opposing views
as mentally ill.

The cynic in me sees this as a power-grab.

What they should have done is specify Bayesianism as the new test, period.
None of this case-by-case BS.

------
CHY872
Interesting reading:

54% of findings with p < 0.05 not statistically significant:
[http://www.dcscience.net/Schuemie-
Madigan-2012.pdf](http://www.dcscience.net/Schuemie-Madigan-2012.pdf) easy
stats paper as to why: [http://www.stats.org.uk/statistical-
inference/Lenhard2006.pd...](http://www.stats.org.uk/statistical-
inference/Lenhard2006.pdf)

------
n00b101
I think it's easier to explain this in terms of likelihood theory. Likelihood
is the probability of observed data GIVEN A SPECIFIC MODEL. This is NOT to be
confused with the probability of a specific model being correct GIVEN THE
OBSERVED DATA. It is the later probability that people want to really know,
but confusing it with the former probability can have catastrophic
consequences in fields like medicine, engineering, jurisprudence, finance and
insurance.

The problem is related to the Prosecutor's Fallacy
([https://en.wikipedia.org/wiki/Prosecutor%27s_fallacy](https://en.wikipedia.org/wiki/Prosecutor%27s_fallacy)):
"Consider this case: a lottery winner is accused of cheating, based on the
improbability of winning. At the trial, the prosecutor calculates the (very
small) probability of winning the lottery without cheating and argues that
this is the chance of innocence. The logical flaw is that the prosecutor has
failed to account for the large number of people who play the lottery."

There is a mathematical statistics professor at the University of Toronto
named D.A.S. Fraser
([http://www.utstat.utoronto.ca/dfraser/](http://www.utstat.utoronto.ca/dfraser/))
who is an expert in likelihood theory and has commented on this issue:
"...statistics does have the answer! The answer is contained in the p-value
function from likelihood theory: p(delta) Here delta is the relevant parameter
with delta_0 as the null value and delta_1 as the alternative needing
detection. Then p(delta_0) is the observed p-value, p(delta_1) is the
detection probability, and the rest is judgement: the route to the Higgs
boson."

------
ghshephard
A few obligatory references:

[https://xkcd.com/1478/](https://xkcd.com/1478/), along with the awesome
explain-xkcd: [http://www.explainxkcd.com/wiki/index.php/1478:_P-
Values](http://www.explainxkcd.com/wiki/index.php/1478:_P-Values)

[https://xkcd.com/892/](https://xkcd.com/892/)

And my favorite:

[https://xkcd.com/882/](https://xkcd.com/882/)

------
po
The video that made it all come together for me was this "Dance of the
P-Values" video:

[https://www.youtube.com/watch?v=5OL1RqHrZQ8](https://www.youtube.com/watch?v=5OL1RqHrZQ8)

It does a great job of showing how the same experiment can yield vastly
different p-values and why it's not useful for the task it's been given.

------
allworknoplay
Their issue is basically with the lack of negative result reporting that makes
p-values useless; it seems very odd to vilify a valuable tool when it's never
going to solve the real problem that people generally re-run experiments until
they stumble upon a publishable metic.

------
analog31
Despite the present day predominance of the life sciences (including
psychology), I wish articles would not describe the p-value crisis as a crisis
of "science". Physical scientists rarely use p-values. There are lots of other
ways to establish the robustness of a result, most importantly, _not relying
on just one tool_.

------
nodata
A very well written article.

