

In defence of the P-value - thanatosmin
https://scientistseessquirrel.wordpress.com/2015/02/09/in-defence-of-the-p-value/

======
stdbrouw
This defense of P-values is akin to how a certain type of computer programmer
tends to defend bad design decisions, like "if you accidentally typed `rm -rf
/` and it wiped out your entire system, then you're an idiot, learn to use
your tools".

Thing is: even statistics professors often don't manage to interpret P-values
and confidence intervals the way they should. (See e.g. "Students’
misconceptions of statistical inference: A review of the empirical evidence
from research on statistics education") You're almost automatically forced
into a sort of double talk where you write about P(D|H0) but deep down you'd
really like your readers to think it's about 1-P(HA|D).

When a tool can be used correctly (if handled very delicately and by a true
expert) but practically encourages you to abuse it, at a certain point "it's
not the tool, it's you" stops being a convincing excuse, and you've just got
to say: fuck it, we need better tools.

~~~
lmm
At some point research is going to be hard - after all, if it were easy to
discover someone would have done so already. Sure, we should always strive to
improve our tools - but that's also hard. Throwing up our hands and stopping
research is not an option. So what do we do?

Is there a better alternative? If the P-value is the frying pan then
Bayesianism is the fire; you don't get a standard library of tools appropriate
to different situations, you get one powerful tool that has to be applied
correctly, and for the part that really matters in the contentious cases -
choice of prior - you're almost completely on your own.

Release your raw data. Use simple, standard analyses; do them once and do them
right, and stay the hell away from subgroup analysis. It really isn't too hard
to get the statistics right - it's just very easy to let yourself slip,
especially when your future funding/career depends on it.

~~~
stdbrouw
Power analysis and confidence intervals have their own problems, but they're
already a vast improvement over p-values. No need to go all-out Bayesian
straight away – though certainly, if we can get the software to a point where
that would be possible for non-statisticians, that'd be awesome. (Bayes
factors help when you don't want to fix a prior.)

For anyone who's interested to read more about this middle ground approach,
"The Essential Guide to Effect Sizes" by Paul Ellis is an awesome introduction
not just to power analysis but to better statistical practice and statistical
reporting that sticks to the frequentist tradition.

Also, I don't feel subgroup analysis is necessarily the problem, at least not
when you apply any of the available corrections for familywise error rates
(Bonferroni and Tukey's HSD being two of the more well known). This often
comes up in medical research, e.g.
[http://www.badscience.net/2009/04/a-frankly-thin-
contrivance...](http://www.badscience.net/2009/04/a-frankly-thin-contrivance-
for-writing-on-the-fascinating-issue-of-subgroup-analysis/) \-- but in those
cases subgroup analyses are performed with bad intentions, a whole other
problem!

A bigger issue, perhaps, is drawing inferences from hypotheses that came up
when analyzing the data in the first place – it's very appealing but it's also
wonderfully circular in its logic and why in machine learning circles it's
become almost self-evident to split up the data into training and test sets.

~~~
IndianAstronaut
>machine learning circles it's become almost self-evident to split up the data
into training and test sets.

This is great for predictive modeling but how does this work for analysis of
experiments?

~~~
stdbrouw
Not experiments, but observational data is often analyzed using regression,
the poor man's predictive modeling :-)

------
navait
This is about more than mere p-values. This is a fundamental problem of tools
themselves. Using a programming language, hammer, or stats tool incorrectly
courts disaster.

But some tools are better than others about teaching people how to use
themselves correctly. Math tools like the p-value are really hard, because
most scientists are not math people - they want a quick solution, whereas
statistics requires deep understanding. P-values are just so appealing as a
magic number. An apt anology is the late 90's PHP people who just wanted a
solution now, rather than designing a robust application for the future. The
question then becomes:

1) How can build tools that lead people toward the correct decision, or at
least avoid the unambiguously wrong ones.

2) If we can't do 1, we need to spend more time educating users, and figuring
out why they make the wrong decisions regarding a tool.

Unfortunately, I can't say these are easy, or even possible problems to solve.
We just do the best we can.

------
fiatmoney
The most important thing your P-value "doesn't do" is tell you if your
underlying model structure is correct.

In particular, the actual value of your P-value is dependent on the underlying
distribution, which rather begs the question.

This is the biggest issue with the P-value in the context of social science or
commerce.

For physical systems, where God[1] actually makes sure that your
distributions, errors, etc are normal, it's pretty awesome.

[1]
[http://en.wikipedia.org/wiki/Central_limit_theorem](http://en.wikipedia.org/wiki/Central_limit_theorem)

~~~
IndianAstronaut
>The most important thing your P-value "doesn't do" is tell you if your
underlying model structure is correct.

It wasn't meant for that though. It is meant as a value to look at which gives
an indication of the probability that what experimental procedure you are
performing is causing a differentiation between groups. It is just one
indication among many diagnostics to run on a model.

>For physical systems, where God[1] actually makes sure that your
distributions, errors, etc are normal, it's pretty awesome.

Non-parametric tests have been around for quite some time.

------
lifeisstillgood
I'm going to go for the embarrassing question, but I don't understand p-values
or how to design a study. Any pointers gratefully recvd

If I understand correctly, let's say I am concerned Earths gravity is
increasing underneath Kent school buildings this stunting kids growth.

I decide to test this by making a statistical survey with a p-value in it.

We shall sample the heights of 10,000 randomly chosen children from 10
counties, one of which is Kent (so 1,000 kids from each school district)

My null hypothesis is that there is no gravitational anaomloly and so the
height distribution should be equal or the mean height should be equal or ...

So I get a bit lost here.

What is 95% significance ...

~~~
thaumasiotes
Going by the wikipedia article, a p-value is the probability of getting a
result "at least as extreme" as what you observe under some hypothesis (in
practice, the null hypothesis, however defined). Imagine we're spinning a
spinner to get one of two values ("red" or "blue"), and we think a good null
hypothesis is "red and blue are equally likely outcomes of this spinner".
Spinning 500 times, we "should" see 250 of each, but obviously the odds of
that are quite low. So if I spin the spinner 500 times and see 400 reds and
100 blues, what does that mean?

Under the null hypothesis, the odds of that result are (0.5)^400 * (0.5)^100 *
(500 choose 400). The first step in calculating the p-value is to decide which
outcomes count as "at least as extreme"; the obvious candidates are (1) any
result with 400 or more reds; and (2) any result with 400 or more of either
color. (1) is better for talking about the hypothesis "this spinner is biased
towards red", and (2) is better for talking about the hypothesis "this spinner
is biased". Obviously, under the null hypothesis, the p-value calculated for
(2) will be twice as large as the p-value calculated for (1), which will have
implications for the publishability of your results, so you should choose (1)
and, if necessary, adjust your hypothesis to suit. ;)

Anyway, in principle, you add up the odds of all the outcomes that count as
"at least as extreme" as the result you observed, and that total is your
p-value. You can report "under the null hypothesis, there is less than a 5%
chance of observing this result" (p < 0.05), and the person after you in the
chain can report "it is at least 95% likely that the spinner is secretly
Republican", and the press can handle things from there.

Now, you've got a very interesting example going on in your question, because
your null hypothesis specifies that the height distribution of children in a
county is due to the local gravity. (Is this a reasonable null hypothesis?
No.) So if you do your survey and find that Kent kids are taller, and the
difference in mean height is significant at the p < 0.03 level, you can say
"we reject the null hypothesis with 97% confidence". The predictive part of
your null hypothesis says "the height distributions are identical in every
county", and that's pretty easy to calculate about. Your rejection is
perfectly justifiable! But, a reporter could then look at the phrasing of the
non-predictive part of your hypothesis and report "97% chance of a
gravitational anomaly in Kent", which would be unjustifiable; if Kent kids are
taller, I'd look at factors like nutrition and genetics long before I even
considered the possibility of a gravitational anomaly. The lesson here is that
rejecting the null hypothesis "there is no gravitational anomaly and therefore
the height distributions of children in every county will be identical"
doesn't allow you to conclude "what do you know, there is a gravitational
anomaly after all!", even though the rejection was fairly legitimate.

~~~
lifeisstillgood
So the nice diagram with a small area of 2nd std Dev filled in is showing all
the results of throwing a dice that get a result at least as extreme as the
observed result.

So if I throw a dice 60 times I would expect a fair dice to come up with 10
1's, 10 2's etc. if it came up with 50 1's I could say that I reject the fair
dice hypothesis with a confidence of

(Thinking)

