
The Effect of Statistical Training on the Evaluation of Evidence (2016) [pdf] - luu
http://www.blakemcshane.com/Papers/mgmtsci_pvalue.pdf
======
ChicagoBoy11
The issue really surfaces when you dissect "statistical training". I had a
conversation with a colleague getting the a PhD and about to publish a paper.
She was mentioning to me that she tried "a bunch" of different specifications
and the one that was significant was [I forget what the variables were, but
this is in the education field].

I pointed out to her that, since she tried a bunch of specifications and chose
the one which yielded the "positive" result, wouldn't that fundamentally alter
the interpretation of the p-value, and possibly invalidate her claim?

She looked at me like I was a moron, uneducated, and trying to be difficult.

I'm forever indebted to my econometrics professor who helped me build a
tremendous mental model about how to think about these issues. But between the
really poor training (most people I encounter in the field just run STATA/SPSS
commands with limited/no understanding of what they do mathematically) and the
terrible incentives facing researchers (you don't see many negative results
published, do you), it's no surprise that when people come in and attempt to
seriously replicate even foundational findings, they often come away
disappointed (thinking about the Reproducibility project -
[https://en.wikipedia.org/wiki/Reproducibility_Project](https://en.wikipedia.org/wiki/Reproducibility_Project))

I, too, don't have an answer to this. But I can only imagine that the way to
prevent is to push the industry in the direction that truth-seeking, not
novelty, is rewarded. I've mentioned this before here, but I know professors
who established their entire careers on the backs of research which was later
completely refuted -- even in cases where their actions where fradulent. And
they are still teaching -- and successful.

As academics - like media and enveryone else -- are chasing our ever-dimming
attention spans, I think this'll be a really tough nut to meaningfully crack

~~~
ams6110
When you've invested 4 or more years in Ph.D. research, there are _powerful_
motivations to "find" results that support your thesis. Perhaps they are
subconscious, but they are powerful.

~~~
nonbel
The people who end up like that are the exact people who the PhD process is
supposed to filter out...

~~~
opportune
No university or professor wants to be known for having any sizable number of
their PhD candidates filtered out in the middle of the process. They'll delay
poor performers for sure, but it just looks really bad when their candidates
essentially fail out.

Also consider that often it's not even the candidate's fault that their
research went nowhere. Sometimes it's just due to chance that a particular
path became a dead-end, but other times it's due to a professor obstinately
pushing them to focus on something unproductive, perhaps well after it's
almost entirely evident that the dead end is coming.

It's a messed up system. I think the entire field of academic research needs a
complete overhaul.

~~~
nonbel
Yes, the current system rewards that behavior and as a side effect rewards
misunderstanding stats, etc.

The people who have no qualms about "statistically significant" = "my theory
is correct" (either due to maliciousness or ignorance) will graduate much
sooner and be able to pump out more papers.

------
WhitneyLand
Why can't journals set a bar and flag some of these practices in peer review?

It's disillusioning to see that academia seems to face as many of these issues
as any other profession or industry.

So much productivity is lost due to behavior driven by long standing
incentives to benefit institutions, corporations, and established power,
rather than incentives that hold people accountable for quality and the
efficient acquisition of knowledge.

Is it so much different than problems that come up with politicians, police
officers, medical doctors, or corporate bad behavior?

I think my naivete was that a community dedicated to truth and discovery would
somehow be less susceptible to these problems, when in fact, it's just human
nature reacting to an environment that can be as unhealthy as any other.

At least I can allow myself an occasional fantasy of earning the 30B I'd need
to take over Elsevier and turn it into a non-profit interested only affecting
positive change.

~~~
tnecniv
> Why can't journals set a bar and flag some of these practices in peer
> review?

Kind of a chicken and egg problem. Peer review is done by other researchers
working in the field, often on a volunteer basis. If the average researcher in
the community is committing statistical common errors, they won't know to flag
them in the paper.

------
neuro_logical
At least speaking as a academic (4th year PhD student), one of the challenges
with researchers over reliance on NHST seems to be the apparent lack of a more
compelling alternative. One candidate which is gaining traction is Bayes
factors but there are challenges with this approach as well e.g. with suitable
specification of priors. The best way forward will involve fundamentally
restructuring how we educate incoming researchers because it will necessarily
mean embracing uncertainty over dichotomization of results into yes/no. Andrew
Gelman and John Carlin write elegantly about ways forward here:
[http://www.stat.columbia.edu/~gelman/research/published/jasa...](http://www.stat.columbia.edu/~gelman/research/published/jasa_signif_2.pdf)

~~~
nonbel
The alternative is to stop testing a default "nil" hypothesis and instead come
up with various mathematical/computational models for the process that
generated the data, then test that on future data.

The model that makes the most precise and accurate predictions with the least
assumptions/complexity should be used until a better one comes along.

This is quite honestly just how science used to be done before NHST. It still
is done this way in many areas of physics and engineering.

------
AstralStorm
Whoever choose letter p (suggesting probability) over s (significance) or e1
(type 1 error value) did a major disservice to the world of science.

P value is a measure of error (And not a probability but maybe one kind of
likelihood), not evidence, how likely it is we are seeing the result of a
chance. (How the chance is defined is important and hidden in the method used
to compute the p value.) And any hard cutoff is really missing the point,
there is a reason this is a number not a binary value.

Effect size estimate or risk ratio are much more useful numbers anyway.

~~~
nonbel
>"P value is a measure of error (And not a probability but maybe one kind of
likelihood), not evidence, how likely it is we are seeing the result of a
chance"

Nope, please stop trying to "help" people understand p-values without
understanding them yourself. Reminds me of this just a few days ago:
[https://news.ycombinator.com/item?id=14646844](https://news.ycombinator.com/item?id=14646844)

Honestly, most of the problem is people who don't know what they are talking
about teaching statistics/science to each other... They have created an insane
vortex of misinformation and ignorance that keeps sucking in new areas of
research.

PS: If you or someone you know is doing research and don't recognize almost
unbelievably huge problems with the way things are done, you are most likely
part of the problem.

~~~
AstralStorm
P is a kind of likelihood with very strong assumption. Likelihood of type 1
error (accidentally getting a result) if error distribution matches student t
squared - if you are using the Fischer test to compute it. (ANOVA that is.)
And if the hypotheses are independent which is supposed to be the case for
null.

Specifically a logarithm of a likelihood ratio of two hypotheses given the
assumption.

Note specific three ifs. Then read what the difference between likelihood and
probability is before correcting people. Or what a difference is between
likelihood and likelihood ratio is.

(Not much in case of null hypothesis testing, unless you didn't check the null
to be true in untreated population. Or how true it is, as in didn't get
likelihood for placebo. You get the actual likelihood approximation from
Wilks' theorem after relating F distribution to chi squared.)

Now this fails if:

The actual distribution of the observations is not close to T or normal.
(There are some more technical reasons, this is a shorthand.) For example
multimodal or highly skewed. This is typically glossed over. (Will usually
cause a false positive.) Has to be checked for null too.

The hypotheses are related. (Say multiple observations over different
dosages.) There are different tests that should be used instead.

A few more technical reasons.

~~~
nonbel
This is wrong in multiple ways. The p-value is not an error rate _even if_ the
null model is correct, "alpha" (the cutoff, usually 0.05) is the error rate.

The p-value also has nothing to do with "how likely it is we are seeing the
result of a chance." As you note regarding the t-distribution, the p-value
calculation _assumes the null model is true_ (which not necessarily, but
usually amounts to "chance did it").

How can a procedure both at once assume something is true and give you the
probability it is true? You are also using the term "likelihood" in a strange
way (it has a specific meaning when it comes to statistics). Anyway, your
comments seem to indicate you hold two standard misinterpretations of
p-values:

1 If P = .05, the null hypothesis has only a 5% chance of being true.

2 A nonsignificant difference (eg, P >= .05) means there is no difference
between groups.

3 A statistically significant finding is clinically important.

4 Studies with P values on opposite sides of .05 are conflicting.

5 Studies with the same P value provide the same evidence against the null
hypothesis.

6 P = .05 means that we have observed data that would occur only 5% of the
time under the null hypothesis.

7 P = .05 and P <=.05 mean the same thing.

8 P values are properly written as inequalities (eg, “P <=.02” when P =.015)

9 P = .05 means that if you reject the null hypothesis, the probability of a
type I error is only 5%.

10 With a P = .05 threshold for significance, the chance of a type I error
will be 5%.

11 You should use a one-sided P value when you don’t care about a result in
one direction, or a difference in

that direction is impossible.

12 A scientific conclusion or treatment policy should be based on whether or
not the P value is significant.

[https://www.ncbi.nlm.nih.gov/pubmed/18582619](https://www.ncbi.nlm.nih.gov/pubmed/18582619)

~~~
AstralStorm
Alpha is the error rate of the binary significance test if at all. Not related
to any of the observations.

However, since p is a loglikelihood ratio, it is related to the observation
and null themselves. To get actual likelihood you need to exponentiate, know
the degrees of freedom and know one of the actual likelihoods. (Typically null
is easier to come by as the other side is assumed to be null+treatment.) This
exact likelihood is of course tiny. Which is why inequality is supposed to be
used.

The process is not valid if any of the test's assumptions is violated in a big
way.

Other than this, all the points are true.

~~~
nonbel
The standard meaning of likelihood is P(data|Hypothesis)[1] where, for the
purposes here, hypothesis will refer to "chance/accident generated the data".
Do you agree with this? (I understand you say "kind of likelihood" because we
are dealing with the inequality)

If so, can you clarify what you mean by "P is ... Likelihood of type 1 error
(accidentally getting a result)..." ?

[1]
[https://en.wikipedia.org/wiki/Likelihood_function](https://en.wikipedia.org/wiki/Likelihood_function)

~~~
kgwgk
The standard meaning of likelihood is P(parameters|data).

~~~
nonbel
Source?

~~~
kgwgk
Actually I should have written L(parameters|data), I'm sorry for the
confusion. Which is equal, for a given value of data and parameters, to
P(data|parameters). But the likelihood function is not a function of the data
(with the parameters fixed), it is a function of the parameters (with the data
fixed). Its "meaning" is not a probability distribution of different outcomes
given the hypothesis. But I maybe misinterpreted your comment.

------
pasbesoin
I guess this could be considered kind of OT. On the other hand, if people were
introduced to it sooner, before and outside of the context of a personal
investment, i.e. "proving" their own research, we might be better off also
with regard to our science.

\----

Delaying statistics until college -- for most public school programs -- is a
mistake.

It leaves a large portion of the U.S. population with a fundamental
mathematical illiteracy.

And given the degree to which statistics rule our lives and collective
decisions these days, a societal illiteracy.

Just look around you.

~~~
tnecniv
Indeed, probability and statistics make a lot of sense to include in a high
school curriculum. In addition to the practicality, they are very colorful
topics that can serve to make students more interested in math.

------
ignostic
Okay, this is really interesting. I'm pretty skeptical of their methodology
for assessing "statistical training." The setup could result in studying a
different correlation entirely. For example, we might instead be studying the
difference in a researcher who's unwilling to send a survey back without
Googling or double-checking the work vs. a researcher who is busy or willing
to be wrong... among researchers who are willing to respond in the first
place. I don't expect that to be a productive line of discussion, though, so
I'll try to ignore it.

I can't help but think of Thomas Kuhn, who argued that the institutions of
science (from researchers to reviewers to the press) tend to ignore study
results that conflict with the current paradigm. So as long as our paradigm is
correct, science progresses extremely quickly. When our paradigm and
assumptions are wrong we spend a lot of time floundering because we ignore
conflicting data that should instead lead to refinements or more questions. We
want to slap a label of right/wrong on something and move on to the next
question. That mentality can really hinder our ability to assess anomalies
later on.

This is similar to the problem discussed in the paper. When researchers decide
a paradigm or test result is true or false with no further thought to the
confidence level or details, you end up with a system where uncertainty and
anomaly are both ignored. It then takes a tremendous amount of momentum to
overcome the assumptions, which are sometimes several steps back into
"accepted science" at that point. We might even throw out some good ideas from
the previous paradigm as we transition. I'm not a physicist, but it seems like
they're seeing this happen right now. We've certainly seen this several times
with dietary health.

A decent analogy might be a hike with unclear trails and markers. At the first
crossroads we might decide that left is the correct path to your destination.
Once down that path, there are dozens of other paths. If we find trails that
continually lead nowhere, the common human response is to keep trying well
past the point where we should have gone back to our original assumption. When
we finally decide to go back and try the right-side path we completely give up
on the left side, despite the fact that we haven't checked every possible sub-
trail on the left side.

Wandering may be impossible to avoid, even with good judgement, but we can
avoid a lot of wasted time by looking back at each of our
crossroads/assumptions, assessing the probability that each is correct, then
moving forward in a way that's most likely to answer a new question while
testing the previous assumptions. By and large this is __not __how science is
working. We get a few studies, accept something is true, and then wander off
randomly down the trails that look most interesting.

