The idea that you get a lot of just marginal significant results due to "P hacking" , i.e. minor fiddling with excluding outliers or picking a certain hypothesis test over another is probably true. Then there is positive publication bias i.e. only publishing positive findings.
However looking in the abstract doesn't really address either of these .. as even absent either P hacking or positive bias you would still expect the abstract to contain the selected highlights (i.e. positive findings) from the paper. It is the bit of the paper where you really should have positive selection bias!
If the paper has lots of negative tests (and most biological papers will report lots of negative control p values) these aren't picked up. A better way to see this problem (and I believe it is a problem) is to look at the whole paper and view the peak of margin results) compared to the whole set of p values.
That's a common problem with concluding things from automated literature analysis. Often the decision to scan only abstracts isn't an intentional experimental-design decision, but made due to "data of convenience": the researcher has easy access to a machine-processable set of plaintext abstracts, but not easy access to a similarly easy to work with set of full papers. Therefore, abstracts are analyzed!
Another potential confound for longer-term analyses is that the form of abstracts is not constant over the years: abstracts in the 1970s and 2010s aren't written in the same ways, and have different norms for what to include and how to include it. Among other things, the form of abstracts has gotten somewhat more structured/boilerplate, which is one reason I suspect they are finding an increase in all hits for their boilerplate search query.
>absent either P hacking or positive bias you would still expect the abstract to contain the selected highlights (i.e. positive findings) from the paper.
Sure, provided that the reported p-values for positive findings have been corrected appropriately for any multiple comparisons. Abstracts should summarise but not mislead...
This is very interesting, however, I think the paper must be viewed with some skepticism:
- their Scopus queries had tons of problems, some of which they acknowledge. Their model for positive and negative results also seems to be inadequate: a paper can report multiple results, each of which will have its own p-value. How would this show up in their queries? Furthermore, how accurate were their queries (i.e. did they quantify them)?
- the results they got from Scopus depended heavily on what they queried for (as mentioned in a previous comment). (This was acknowledged in the paper)
- what about all the other p-values? They only looked at 0.04-0.049 and 0.051 and 0.06. What about 0.5? What about < 0.04? What about > 0.06? I can't understand why they don't report results for these other ranges, especially when they were already doing automated analysis. This makes me extremely suspicious.
- results before 1996 are suspect because the Scopus data is incomplete; this is assumed to not matter because "no discontinuity appears in Figs. 3 & 5." I.e., the authors have no idea what the results of the query would look like across the full data set.
I don't want to throw a snarky remark out, but damn that had a bunch of weasel-ese in it. Looks kinda bad, but in a way it looks kinda good...
In addition, I found problematic the statement "The results indicate that negative results are not disappearing, but have actually become 4.3 times more prevalent since 1990. Positive results, on the other hand, have become 13.9 times more prevalent since 1990." "More prevalent", one presumes in this context, is a measurement of a subset against the total. Otherwise all you're really getting is a count of how much research is being done -- which I believe is what is being reported? That doesn't sound like a useful metric to me. Why would I care about the counts of positives or negatives when, in fact, there are only two types of things being counted? All I would really care about in this context would be their relative sizes.
There are numerous graphs plotting both the relative sizes of abstracts with positive and negative results against the total (not every abstract contains the wording they checked), e.g. Fig. 3, and graphs of their relative ratios (e.g. Fig. 4).
However, they somewhat admit that their findings are mostly spurious since checking for "p < 0.5" instead of "p = 0.04…" turned the results more or less upside down.
If you look at Figure 3 (and section 3.2.1), you'll find they're reporting the number of papers in percentage of the total number of papers. So, relative size. The increase given in the abstract is that between 1990 and 2014.
A scholar of how scientific research is conducted and of statistical errors that show up in many peer-reviewed scientific publications, Uri Simonsohn, has devoted much thought with his colleagues to the issue of "p-hacking." Simonsohn is a professor of psychology with a better than average understanding of statistics. He and his colleagues are concerned about making scientific papers more reliable. Many of the interesting issues brought up by the comments on the article kindly submitted here become much more clear after reading Simonsohn's various articles[1] about p values and what they mean, and other aspects of interpreting published scientific research.
Simonsohn provides an abstract (which links to a full, free download of a funny, thought-provoking paper)[2] with a "twenty-one word solution" to some of the practices most likely to make psychology research papers unreliable. He has a whole site devoted to avoiding "p-hacking,"[3] an all too common practice in science that can be detected by statistical tests. You can use the p-curve software on that site for your own investigations into p values found in published research.
He also has a paper on evaluating replication results[4] (an issue we discuss from time to time here on Hacker News) with more specific tips on that issue.
"Abstract: "When does a replication attempt fail? The most common standard is: when it obtains p>.05. I begin here by evaluating this standard in the context of three published replication attempts, involving investigations of the embodiment of morality, the endowment effect, and weather effects on life satisfaction, concluding the standard has unacceptable problems. I then describe similarly unacceptable problems associated with standards that rely on effect-size comparisons between original and replication results. Finally, I propose a new standard: Replication attempts fail when their results indicate that the effect, if it exists at all, is too small to have been detected by the original study. This new standard (1) circumvents the problems associated with existing standards, (2) arrives at intuitively compelling interpretations of existing replication results, and (3) suggests a simple sample size requirement for replication attempts: 2.5 times the original sample."
AFTER EDIT: Hat tip to HN participant jasonhoyt for noticing that the URL on the thread-opening submission was not the canonical URL, and doesn't point to the latest preprint version of the article we are discussing in this thread. The canonical URL (which is generally to be preferred for a posting to HN) is
Quote from the paper: "Fanelli found that the number of papers providing support for the main hypothesis had
increased from 70% in 1990 to 86% in 2007 (it is unclear why Fanelli reported an over 22% increase in the abstract)."
of course, my calculator yields 86/70=1.228 (which would be a >22% increase)
such basic flaws really dont bode well for the rest of the paper.
It's especially relevant that the 70% and 86% here are percentages of different totals.
A "22% increase" sounds like the underlying number of positive results increased by 22%, which it didn't. It increased by more than 22% because the total number of publications is increasing too.
An "increase of 16 percentage points" would have been much more standard.
"A surge of p-values between 0.040 and 0.049 in recent decades"
hmm ... what are the chances of that?
;)
Seriously though, why is the null hypothesis not considered? Wouldn't it strengthen their case, to reject the null hypothesis that this increase is just happening randomly?
To use hypothesis testing, you first have to declare the null hypothesis and, in statistics, decide on a model by which you will test the null. In this case, can you state what the null hypothesis is and the model? It's unclear to me. There is no obvious probability distribution of p-values, and any assumption as to their distribution would be hard to defend.
And that's why, I suspect, no null hypothesis is stated or tested.
What does the expected distribution of p-values look like? I suppose an answer should consider random hypothesis separately from those that are tested because there was some indication they might be true.
Some people model the distribution of p-values as a mixture of two distributions - called the BUM model.
The false positives are modelled as a uniform distribution. I think that this is the equivalent of the "random hypothesis" in your question.
The peak of p-values that occur for true positives is modelled as a beta distribution. The beta distribution can take lots of shapes and has to have its parameters fitted to match the collection of p-values.
My answer is from memory but if you are interested there is more information in this clear paper:
The paper concentrates on microarrays which provide interesting data when doing a lot of hypothesis testing - you do a lot of hypothesis testing but typically choose 1 p-value as a cutoff for all the testing.
v3 change log:
We improved the legends of the tables, to make clear that the table contains the percentage of papers reporting a significant result over the sum of the papers reporting a significant or non-significant result (S/(S+NS)).
v2 change log:
This is version 2 of an earlier preprint with some corrections in figure legends.
However looking in the abstract doesn't really address either of these .. as even absent either P hacking or positive bias you would still expect the abstract to contain the selected highlights (i.e. positive findings) from the paper. It is the bit of the paper where you really should have positive selection bias!
If the paper has lots of negative tests (and most biological papers will report lots of negative control p values) these aren't picked up. A better way to see this problem (and I believe it is a problem) is to look at the whole paper and view the peak of margin results) compared to the whole set of p values.