
A surge of p-values between 0.040 and 0.049 in recent decades [pdf] - ggreer
https://peerj.com/preprints/447v3.pdf
======
Malarkey73
The idea that you get a lot of just marginal significant results due to "P
hacking" , i.e. minor fiddling with excluding outliers or picking a certain
hypothesis test over another is probably true. Then there is positive
publication bias i.e. only publishing positive findings.

However looking in the abstract doesn't really address either of these .. as
even absent either P hacking or positive bias you would still expect the
abstract to contain the selected highlights (i.e. positive findings) from the
paper. It is the bit of the paper where you really should have positive
selection bias!

If the paper has lots of negative tests (and most biological papers will
report lots of negative control p values) these aren't picked up. A better way
to see this problem (and I believe it is a problem) is to look at the whole
paper and view the peak of margin results) compared to the whole set of p
values.

~~~
mjn
That's a common problem with concluding things from automated literature
analysis. Often the decision to scan only abstracts isn't an intentional
experimental-design decision, but made due to "data of convenience": the
researcher has easy access to a machine-processable set of plaintext
abstracts, but not easy access to a similarly easy to work with set of full
papers. Therefore, abstracts are analyzed!

Another potential confound for longer-term analyses is that the form of
abstracts is not constant over the years: abstracts in the 1970s and 2010s
aren't written in the same ways, and have different norms for what to include
and how to include it. Among other things, the form of abstracts has gotten
somewhat more structured/boilerplate, which is one reason I suspect they are
finding an increase in _all_ hits for their boilerplate search query.

------
mattfenwick
This is very interesting, however, I think the paper must be viewed with some
skepticism:

\- their Scopus queries had tons of problems, some of which they acknowledge.
Their model for positive and negative results also seems to be inadequate: a
paper can report multiple results, each of which will have its own p-value.
How would this show up in their queries? Furthermore, how accurate were their
queries (i.e. did they quantify them)?

\- the results they got from Scopus depended heavily on what they queried for
(as mentioned in a previous comment). (This was acknowledged in the paper)

\- what about all the other p-values? They only looked at 0.04-0.049 and 0.051
and 0.06. What about 0.5? What about < 0.04? What about > 0.06? I can't
understand why they don't report results for these other ranges, especially
when they were already doing automated analysis. This makes me extremely
suspicious.

\- results before 1996 are suspect because the Scopus data is incomplete; this
is assumed to not matter because "no discontinuity appears in Figs. 3 & 5."
I.e., the authors have no idea what the results of the query would look like
across the full data set.

------
jtleek
If you find this interesting you might also like:

[http://biostatistics.oxfordjournals.org/content/early/2013/0...](http://biostatistics.oxfordjournals.org/content/early/2013/09/24/biostatistics.kxt007.full)

the paper was discussed by several leading statisticians including Andrew
Gelman, D.R. Cox and Yoav Benjamini:

[http://simplystatistics.org/2013/09/25/is-most-science-
false...](http://simplystatistics.org/2013/09/25/is-most-science-false-the-
titans-weigh-in/)

------
DanielBMarkham
I don't want to throw a snarky remark out, but damn that had a bunch of
weasel-ese in it. Looks kinda bad, but in a way it looks kinda good...

In addition, I found problematic the statement "The results indicate that
negative results are not disappearing, but have actually become 4.3 times more
prevalent since 1990. Positive results, on the other hand, have become 13.9
times more prevalent since 1990." "More prevalent", one presumes in this
context, is a measurement of a subset against the total. Otherwise all you're
really getting is a count of how much research is being done -- which I
believe is what is being reported? That doesn't sound like a useful metric to
me. Why would I care about the counts of positives or negatives when, in fact,
there are only two types of things being counted? All I would really care
about in this context would be their relative sizes.

I probably missed it.

~~~
claudius
There are numerous graphs plotting both the relative sizes of abstracts with
positive and negative results against the total (not every abstract contains
the wording they checked), e.g. Fig. 3, and graphs of their relative ratios
(e.g. Fig. 4).

However, they somewhat admit that their findings are mostly spurious since
checking for "p < 0.5" instead of "p = 0.04…" turned the results more or less
upside down.

------
tokenadult
A scholar of how scientific research is conducted and of statistical errors
that show up in many peer-reviewed scientific publications, Uri Simonsohn, has
devoted much thought with his colleagues to the issue of "p-hacking."
Simonsohn is a professor of psychology with a better than average
understanding of statistics. He and his colleagues are concerned about making
scientific papers more reliable. Many of the interesting issues brought up by
the comments on the article kindly submitted here become much more clear after
reading Simonsohn's various articles[1] about p values and what they mean, and
other aspects of interpreting published scientific research.

Simonsohn provides an abstract (which links to a full, free download of a
funny, thought-provoking paper)[2] with a "twenty-one word solution" to some
of the practices most likely to make psychology research papers unreliable. He
has a whole site devoted to avoiding "p-hacking,"[3] an all too common
practice in science that can be detected by statistical tests. You can use the
p-curve software on that site for your own investigations into p values found
in published research.

He also has a paper on evaluating replication results[4] (an issue we discuss
from time to time here on Hacker News) with more specific tips on that issue.

"Abstract: "When does a replication attempt fail? The most common standard is:
when it obtains p>.05. I begin here by evaluating this standard in the context
of three published replication attempts, involving investigations of the
embodiment of morality, the endowment effect, and weather effects on life
satisfaction, concluding the standard has unacceptable problems. I then
describe similarly unacceptable problems associated with standards that rely
on effect-size comparisons between original and replication results. Finally,
I propose a new standard: Replication attempts fail when their results
indicate that the effect, if it exists at all, is too small to have been
detected by the original study. This new standard (1) circumvents the problems
associated with existing standards, (2) arrives at intuitively compelling
interpretations of existing replication results, and (3) suggests a simple
sample size requirement for replication attempts: 2.5 times the original
sample."

[1] [http://opim.wharton.upenn.edu/~uws/](http://opim.wharton.upenn.edu/~uws/)

[2]
[http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2160588](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2160588)

[3] [http://www.p-curve.com/](http://www.p-curve.com/)

[4]
[http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2259879](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2259879)

AFTER EDIT: Hat tip to HN participant jasonhoyt for noticing that the URL on
the thread-opening submission was not the canonical URL, and doesn't point to
the latest preprint version of the article we are discussing in this thread.
The canonical URL (which is generally to be preferred for a posting to HN) is

[https://peerj.com/preprints/447/](https://peerj.com/preprints/447/)

------
kghose
You can see them straining, gritting their teeth saying "Must ... not ...
report ... p-value ... must ... report ... CONFIDENCE INTERVAL"

~~~
Paradigma11
But with a CI you answer a different question that may or may not be a better
fit to your needs.

You should not report a p-value because that is not the answer that null
hypothesis significance testing gives you.

You get significant or not significant, that's it.

------
timwaagh
Quote from the paper: "Fanelli found that the number of papers providing
support for the main hypothesis had increased from 70% in 1990 to 86% in 2007
(it is unclear why Fanelli reported an over 22% increase in the abstract)."

of course, my calculator yields 86/70=1.228 (which would be a >22% increase)

such basic flaws really dont bode well for the rest of the paper.

~~~
efaref
I think it's generally frowned upon to do percentages of percentages.

For example: An increase from 70 bananas to 86 bananas is an increase of 16
bananas, or 22 per cent.

But an increase from 70 per cent to 86 per cent is an increase of 16 per cent.

~~~
rspeer
It's especially relevant that the 70% and 86% here are percentages of
different totals.

A "22% increase" sounds like the underlying number of positive results
increased by 22%, which it didn't. It increased by more than 22% because the
total number of publications is increasing too.

An "increase of 16 percentage points" would have been much more standard.

------
plg
"A surge of p-values between 0.040 and 0.049 in recent decades"

hmm ... what are the chances of that?

;)

Seriously though, why is the null hypothesis not considered? Wouldn't it
strengthen their case, to reject the null hypothesis that this increase is
just happening randomly?

~~~
jerrytsai
To use hypothesis testing, you first have to declare the null hypothesis and,
in statistics, decide on a model by which you will test the null. In this
case, can you state what the null hypothesis is and the model? It's unclear to
me. There is no obvious probability distribution of p-values, and any
assumption as to their distribution would be hard to defend.

And that's why, I suspect, no null hypothesis is stated or tested.

------
phkahler
What does the expected distribution of p-values look like? I suppose an answer
should consider random hypothesis separately from those that are tested
because there was some indication they might be true.

~~~
papaf
Some people model the distribution of p-values as a mixture of two
distributions - called the BUM model.

The false positives are modelled as a uniform distribution. I think that this
is the equivalent of the "random hypothesis" in your question.

The peak of p-values that occur for true positives is modelled as a beta
distribution. The beta distribution can take lots of shapes and has to have
its parameters fitted to match the collection of p-values.

My answer is from memory but if you are interested there is more information
in this clear paper:

[http://bioinformatics.oxfordjournals.org/content/19/10/1236....](http://bioinformatics.oxfordjournals.org/content/19/10/1236.short)

The paper concentrates on microarrays which provide interesting data when
doing a lot of hypothesis testing - you do a lot of hypothesis testing but
typically choose 1 p-value as a cutoff for all the testing.

------
jasonhoyt
Heads up! The original link is pointing to v1 of the preprint. Current version
v3 - and can be found at the canonical URL
[https://peerj.com/preprints/447/](https://peerj.com/preprints/447/) or
[https://peerj.com/preprints/447v3/](https://peerj.com/preprints/447v3/)

v3 change log: We improved the legends of the tables, to make clear that the
table contains the percentage of papers reporting a significant result over
the sum of the papers reporting a significant or non-significant result
(S/(S+NS)).

v2 change log: This is version 2 of an earlier preprint with some corrections
in figure legends.

~~~
dang
Thanks! We changed the url to v3.

