
Don't calculate p-values without understanding them - fbeeper
http://www.scottbot.net/HIAL/?p=24697
======
btilly
It is great to understand this stuff. But people don't. And I am personally of
the belief that p-values are popular exactly because they are so easy to
misunderstand as the answer to the question that we want to ask (what is the
probability that we are right).

That said, if you need to use p-values in A/B testing, you might want to read
[http://elem.com/~btilly/ab-testing-multiple-
looks/part1-rigo...](http://elem.com/~btilly/ab-testing-multiple-
looks/part1-rigorous.html) to get a procedure that gives always valid
p-values, and then [http://elem.com/~btilly/ab-testing-multiple-
looks/part2-limi...](http://elem.com/~btilly/ab-testing-multiple-
looks/part2-limited-data.html) for a more practical alternative and the
caveats. (I still intend to return to the series, but not for a bit more.)

~~~
msellout
No such thing as an always-valid p-value. Observations that were predictive
yesterday may have no value tomorrow.

~~~
btilly
The usefulness of a statistical answer always, of course, depends on the
validity of the statistical model used to generate said statistics. But that
particular statistical model allows unlimited looks at the data, and unlimited
stopping opportunities, without ruining the p-value.

------
glaugh
My biggest pet peeve is the interpretation of p-values absent the context of
effect sizes. If you have a huge sample size you're quite frequently going to
find statistically significant differences between groups, but often those
differences aren't that meaningful.

If you assessed differences in average human height between two cities based
on a million datapoints, you're pretty darn likely to find a difference that's
statistically significant but not really important or meaningful (like a .1mm
difference in average height).

The above is why the American Psychological Association says "reporting and
interpreting effect sizes in the context of [p-values] is essential to good
research".[1] It's also why Statwing always report effect sizes along with
p-values for hypothesis tests.[2]

(For clarity: effect size is best presented in readily interpretable, concrete
terms like height or whatever unit you're using. If that's not possible
because you're comparing ratings on a 1 to 7 scale or something, or if you
want to compare effect sizes across different types of analyses, there are
specific metrics of effect size).

[1] <http://people.cehd.tamu.edu/~bthompson/apaeffec.htm>

[2] <https://www.statwing.com/demo>

~~~
btilly
The effect sizes that you're talking about are the result of hidden
correlations causing the random variation to be larger than the statistical
model thought possible. On the one hand you can always argue that this is a
sign you did statistics wrong. On the other hand the internal correlations
caused by ethnicity, diet, environmental pollution, etc truly are hard to
estimate.

Another common cause of finding significant p-values when you shouldn't is
looking at many metrics (or the same metric at many points of time).
Eventually you get (un)lucky.

However if you do the stats correctly, and the statistical model (which
usually assumes independence) is accurate, then a p-value of under 5% will
only happen 5% of the time. No matter how big the sample size is.

~~~
glaugh
Yeah, my point isn't that the statistical significance is erroneous or that
the difference shouldn't have been significant.

In my totally made up example, there's really a .1mm difference. And maybe to
someone that's meaningful, so it's worth reporting. But it's not a big enough
effect to say "When I see someone on the street I can look at their height and
know which of these two cities they're from."

And my peeve is that people will see a statistically significant difference
and think that that alone makes it an important finding, versus a very real
finding that might not actually matter that much.

Or, related, it's bad to see something with a p-value of .15 and a large
effect size and decide that since it's not significant at .05 it's not
interesting. Since that combination typically means that you have a too-small
sample size, the best interpretation is probably more like "This might be
interesting. Looks like we don't have enough data to tell if there's a
relationship here, but it probably deserves another look."

(Edited to add that last thought, and clarify the previous one)

------
mrgoldenbrown
Obligatory xkcd demonstrating p-value shenanigans: <http://xkcd.com/882/>

------
tedsanders
Great topic. P-values are so easy to misinterpret, in fact, that I think the
article makes the very error that it warns against:

>"If it’s under 5%, p < 0.05, we can be reasonably certain that our result
probably implies a stacked coin."

By itself, a p-value is NOT enough to imply that the null hypothesis is false.
In fact, if I flipped a regularly looking coin and saw 7 heads, I'd still be
very confident that the coin is fair, because weighted coins are so rare.
Later, the article correctly warns:

>P-value misconception #5: "1 − (p-value) is not the probability of the
alternative hypothesis being true (see (1))."

P.S. I think the weasel words in the first quoted sentence, "reasonably
certain" and "probably implies," show that the author is at least
subconsciously aware of this logical error. :)

------
tokenadult
A really good new site about "p-hacking" and how to detect it

<http://www.p-curve.com/>

is by Uri Simonsohn, a professor of psychology with a better than average
understanding of statistics, and colleagues who are concerned about making
scientific papers more reliable. You can use the p-curve software on that site
for your own investigations into p values found in published research.

Many of the issues brought up by the blog post kindly submitted here and the
comments that were submitted here before this comment become much more clear
after reading Simonsohn's various articles

<http://opim.wharton.upenn.edu/~uws/>

about p values and what they mean, and other aspects of interpreting published
scientific research.

------
dbecker
The common misconceptions of p-values make them appear more relevant than they
really are.

I'm skeptical they would be used nearly as much if they were properly
understood.

------
pseut
This line is silly:

> _you can just pretend you went into your experiment with different halting
> conditions and, voila!, your results become significant._

You can misrepresent your results regardless of the underlying statistic. But
it's no easier to lie about p-values than to lie about any other statistical
procedure.

Anyway, the post seems to be more about hypothesis testing than p-values per
se.

~~~
scottbot
I agree, lying is bad regardless! My post isn't anti p-values, it's anti
poorly-understood-or-performed-statistics. NHST just happens to be the subject
of choice, because it's particularly misunderstood. Anyway, I'm less worried
about lying, and more worried about accidental inaccuracies, e.g., someone
collecting data until they get tired or run out of funding, but running the
calculation as though the "intent" was to get exactly that number of
observations.

~~~
pseut
I don't know... this line towards the end suggests you're anti-testing:

"It is for this reason that I’m trying desperately to get quantitative
humanists using non-parametric and Bayesian methods from the very beginning,
before our methodology becomes canonized and set."

:)

The "the design of the experiment shouldn't matter so much" assertion is
usually followed by an appeal to the likelihood principle and a claim that
frequentist estimation is misguided. If that's not what you had in mind,
apologies. I've never seen it coupled with a claim that the frequentist would
then misrepresent their experiment...

If the quoted statement is your goal, I think a more convincing argument is,
"we often want a more nuanced way to express uncertainty than classical
tests/confidence intervals give us, and... LOOK! We get that for free using
Bayesian principles."

As an aside, running out of funding seems like it should give you the same
results as a predetermined sample size, as long as the funding isn't
conditional on getting interesting (i.e. statistically significant) results,
but I'd need to actually do the math to be certain.

~~~
scottbot
My stance is much softer than that, but I should have made it clearer, because
similar arguments are often made in anti-frequentist rants. I think that there
is an appropriate place for most statistics used (including NHST under the
right circumstances) - and often, the differences between the results are
entirely negligible.

My goal is to make people aware of the various stats out there, their benefits
and pitfalls, and let people choose whatever is the most appropriate for their
needs. Those choices need to be informed, and given that most introductory
stats starts with p-values and seems to teach them wrong, that's where this
post is aimed.

Regarding your aside, the universe of possible observations in a given
experiment assuming 100 trials may be very different than the universe
assuming trials until we run out of money, which happened to fall on 100.

~~~
sesqu
If you want to go down that route, I suggest examining the assumptions of so
called nonparametric tests and uninformative priors as future topics.

