
Statisticians Find They Can Agree: It’s Time to Stop Misusing P-Values - tokenadult
https://fivethirtyeight.com/features/statisticians-found-one-thing-they-can-agree-on-its-time-to-stop-misusing-p-values/
======
tokenadult
The article submitted here leads to the American Statistical Association
statement on the meaning of p values,[1] the first such methodological
statement ever formally issued by the association. It's free to read and
download. The statement summarizes into these main points, with further
explanation in the text of the statement.

"What is a p-value?

"Informally, a p-value is the probability under a specified statistical model
that a statistical summary of the data (for example, the sample mean
difference between two compared groups) would be equal to or more extreme than
its observed value.

"Principles

"1\. P-values can indicate how incompatible the data are with a specified
statistical model.

"2\. P-values do not measure the probability that the studied hypothesis is
true, or the probability that the data were produced by random chance alone.

"3\. Scientific conclusions and business or policy decisions should not be
based only on whether a p-value passes a specific threshold.

"4\. Proper inference requires full reporting and transparency.

"5\. A p-value, or statistical significance, does not measure the size of an
effect or the importance of a result.

"6\. By itself, a p-value does not provide a good measure of evidence
regarding a model or hypothesis."

[1] "The ASA's statement on p-values: context, process, and purpose"

[http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016....](http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108)

~~~
wdewind
> "3\. Scientific conclusions and business or policy decisions should not be
> based only on whether a p-value passes a specific threshold.

There is no lower threshold at which the data becomes non-predictive?

~~~
carbocation
More likely they mean the opposite: that a statistically significant P value
(by whatever threshold you decide to use) should not be used, by itself, to
drive policy decisions. Internally, the effect size still matters. Externally,
there are numerous other factors that should drive decisionmaking.

~~~
wdewind
Right, but I guess what I'm getting at is frequently we see people doing even
worse: making policy decisions based on data that doesn't even hit a minimum
threshold for acceptability. I totally agree that people shouldn't make policy
decisions solely because the data supports it, but frequently you see people
make decisions based on data _indicating_ something without actually being
statistically significant. That feels like a bigger problem to me than people
actually getting good data but then using it too confidently, which I have
rarely if ever seen.

TLDR: People tend to make policy decisions based on data even when the data is
basically useless, I'm more worried about that then blindly following good
data.

~~~
skybrian
One issue is that if you have a large effect that's consistently and easily
reproduced, you don't actually need very accurate measurements or a
statistical analysis at all. So any minimum standard would need to take into
account.

Another issue is that science is expensive and we need to make decisions all
the time whether there is any science backing them or not. So what do you do
if there's no science that meets the minimum standards?

~~~
wdewind
> One issue is that if you have a large effect that's consistently and easily
> reproduced, you don't actually need very accurate measurements or a
> statistical analysis at all.

I mostly agree with this, but you never have that in a difficult decision.

> Another issue is that science is expensive and we need to make decisions all
> the time whether there is any science backing them or not. So what do you do
> if there's no science that meets the minimum standards?

Right! My entire point is that maybe data driven decision making isn't as
useful as we think because it mostly doesn't hit scientific levels of
accuracy. Maybe there are other, more intuitive systems, that make as much if
not more sense, in absence of good data. How would we even know either way? We
can't!

Regardless I'd love for the discussion to be "this is the best we can do"
rather than "yes we did A/B testing so we _know_ it's true!!!"

~~~
thaumasiotes
>> One issue is that if you have a large effect that's consistently and easily
reproduced, you don't actually need very accurate measurements or a
statistical analysis at all.

> I mostly agree with this, but you never have that in a difficult decision.

Consider the problem "should we cure congenital deafness in infants?"

To Deaf Community Leaders, this isn't a difficult decision at all; they are
strongly against curing deaf children because it makes their power base
smaller.

To deaf parents of deaf children, this is a tricky choice. Curing the child's
deafness dramatically improves its prospects in life, but it also inevitably
cuts the child off from the parents' community. You have to choose between how
good you want your child's future to be, and how close you'd like your
relationship with them to be.

To hearing parents of deaf children, this is again a trivial choice; obviously
you'd cure the child.

BUT, curing deafness is definitely (1) a large effect that is (2) easily
reproduced.

~~~
thaumasiotes
Having thought a bit more about this, I want to disagree more strongly with
the sentiment "you never have [large, reproducible effects] in a difficult
decision". I think difficult decisions necessarily involve that kind of
effect.

A couple of things might make a decision difficult:

\- Some course of action will produce a big effect. Would that be good or bad?

\-- Ok, assume there's a good effect out there to be achieved. Is it worth the
cost of obtaining it?

Those questions, variously applied, occupy a lot of people's time and
brainpower. In the curing-a-deaf-child example, the parents are making a
tradeoff between quality of their child's life (which is good), and closeness
to their child (also good). Paying one for the other means giving up something
good, which makes the decision nontrivial. But this is a difficult decision
because the effects are _large_ , not because they're small.

In contrast, if you're dealing with an effect of very small size, or one that
can't be reproduced ("lose weight by following our new diet!")... this might
feel like a difficult decision, but it shouldn't. It _does not matter_ what
you decide, because (by hypothesis!) your decision will have no effect on the
outcome! (Or, for the "very small effect size" case, at most a very small
effect.)

------
btilly
I agree as well!

Here is what probability theory teaches us. The proper role of data is to
adjust our prior beliefs about probabilities to posterior beliefs through
Bayes' theorem. The challenge is how to best communicate this result to people
who may have had a wide range of prior beliefs.

p-values capture a degree of surprise in the result. Naively, a surprising
result should catch our attention and cause us to rethink things. This is not
a valid statistical procedure, but it IS how we naively think. And the
substitution of a complex question for a simpler one is exactly how our brains
are set up to handle complex questions about our environment. (I'm currently
working through _Thinking Fast and Slow_ which has a lot to say about this.)

Simple Bayesian approaches take the opposite approach. You generally start
with some relatively naive prior, and then treat the posterior as being the
conclusion. Which is not very realistic if the real prior was something quite
different.

Both approaches have a fundamental mistake. The mistake is that we are taking
a data set and asking what it TELLS us about the world. When in probability
theory the real role of data is how to UPDATE our views about the world.

This is why I have come to believe that for simple A/B testing, thinking about
p-values is a mistake. The only three pieces of information that you need is
how much data you are willing to collect, how much you have collected, and how
big the performance difference is. Stop either when you have hit the maximum
amount of data you're willing to throw at the test, or when the difference
exceeds the square root of that maximum amount. This is about as good as any
simple rule can do.

If you try to be clever with p-values you will generally wind up putting
yourself at saving yourself some effort in return for a small risk per test of
making very bad mistakes. Accepting small risk per test over many tests for a
long time puts you at high odds of eventually making a catastrophic mistake.
This is a very bad tradeoff.

I've personally seen a bad A/B test with a low p-value rolled out that
produced a 15% loss in business for a company whose revenues were in the tens
of millions annually. It. Was. Not. Pretty. (The problem was eventually found
and fixed..a year later and after considerable turnover among the executive
team.)

~~~
yummyfajitas
_Simple Bayesian approaches take the opposite approach. You generally start
with some relatively naive prior, and then treat the posterior as being the
conclusion. Which is not very realistic if the real prior was something quite
different._

I don't think this is a completely accurate portrayal of Bayesian stats. In
Bayesian stats, there is no "real prior". Probability distributions are all
subjective representations of belief. The prior is just what you believe
_prior_ to evidence, and the posterior is what you believe after you've taken
evidence into account.

That said, moving away from p-values and towards something more robust is
something the A/B testing industry needs. (Obviously I have my own opinion of
what that something should be, and it's a bit different from what you are
advocating.) There are far too many consultancies and agencies p-hacking their
way to positive results ("hey unsophisticated client - guess what I made your
conversion rate go up 25%!") and I'd love to see every one of them die.

~~~
btilly
There are different approaches a Bayesian might take. The one that I described
is certainly among them, though it is not the only one.

~~~
achompas
I think the word "naive" is problematic here. Have you seen instances where
Bayesians choose a prior that isn't at least somewhat informed by exploratory
analysis?

~~~
mturmon
It's quite common to choose a conjugate prior, which aids computation, and
which is readily comprehensible within the discipline ("...and a Wishart prior
for the covariance, of course..."). But which is, in effect, not at all
informed by the data.

It's also common to have a complex, multi-level model setup which has a few
hyperparameters (say, gamma distribution shapes) which are set basically
arbitrarily. The idea being that the posteriors for the lowest levels (closest
to data) will be learned/fitted, but the hyper parameters at the top-level of
the model are fixed. This is very common in spatial statistics.

It's also common to make lots of (conditional) independence assumptions, just
because they are convenient. Such as diagonal covariances, or conditional
independence between "separate" elements of a model. But these independence
assumptions are often just convenient or based on intuition, but not re-
checked. Or, if they are checked and found wanting, it's "left for future
work".

These practices are defensible, and attackable. But choosing convenient priors
out of a bag of standard priors, without reference to the problem at hand, is
very common.

------
johan_larson
It's not just p-values. Some people just don't understand even very basic
statistics.

I remember talking to one person in marketing who ran surveys of the company's
users. They would send out a survey to all registered users, get back
responses from 1% of them or something, and then proceed to report findings
based on the responses. They were really happy, since a 1% response rate is
great for surveys like this.

I tried to explain to them that all of this statistical machinery relies on
having a random sample, and a self-selected sample is not that. No effect
whatsoever. Surveys like this are standard practice in the industry. Why are
you making trouble, geek-boy?

~~~
roel_v
" Surveys like this are standard practice in the industry."

So, how does one exploit this apparent bad practice? I.e., how does one make a
profit from others making this mistake? The answer is, of course, one doesn't
- otherwise others would've done so. So what does this tell us about the state
of affairs? Is it that bad sampling doesn't matter for practical purposes, or
that marketing research is useless? I don't know, and I'm not trying to be
belligerent here. A situation similar to this shows up in dozens of places
every day - more often than not, it doesn't matter if things are done 'right',
there are large margins within which 'good enough' is indistinguishable from
'right'.

This bothers me greatly, but it's hard to argue against this conclusion,
empirically. How to deal with this cognitive dissonance? I mean this is the
exact topic at least half of the blog posts that make it to the HN front page
deal are fundamentally about.

------
Houshalter
P-values are so weird. Studies should instead report a likelihood ratio. A
likelihood ratio is mathematically correct, and tells you exactly how much to
update a hypothesis.

You can convert p-values to likelihood ratios, and they are quite similar. But
its not perfect. A p value of 0.05 becomes 100:5, or 20:1. Which means it
increases the odds of a hypothesis by 20. So a probability of 1% updates to
17%, which is still quite small.

But that assumes that the hypothesis has a 100% chance of producing the same
or greater result, which is unlikely. Instead it might only be 50% which is
half as much evidence.

In the extreme case, it could be only 5% likely to produce the result, which
means the likelihood ratio is 5:5 and is literally no evidence, but still has
a p value of 0.05.

Anyway likelihood ratios accumulate exponentially, since they multiply
together. As long as there is no publication bias, you can take a few weak
studies and produce a single very strong likelihood update.

------
chimeracoder
For context, I have a degree in statistics, and I did research with Andrew
Gelman (one of the statisticians quoted in the article).

Glad to see this is gaining traction! I've been saying this for years: the
world would actually be in a better place if we just abandoned p-values
altogether.

Hypothesis testing is taught in introductory statistics courses because the
calculations involved are deceptively easy, whereas more sophisticated
statistical techniques would be difficult without any background in linear
algebra or calculus.

Unfortunately, this enables researchers in all sorts of fields to make
incredibly spurious statistical analyses that _look_ convincing, because "all
the calculations are right", even though they're using completely the wrong
tool.

Andrew Gelman, quoted in the article, feels very strongly that F-tests are
always unnecessary[0]. I'd go as far as to extend that logic to the Student's
t-test and any other related test as well.

You can get into all sorts of confusing "paradoxes" with p-values. One of my
favorites:

Alice wants to figure out the average height of population. Her null
hypothesis is 65 inches. She conducts a simple random sample, performs a
t-test, and determines that the sample mean is 70 inches, with a p-value of
.01.

In an alternate universe, Bob does the same thing, with the same null
hypothesis (65 inches). He determines that the sample mean is 90 inches, with
a p-value of .000001.

Some questions:

A) Does Bob's experiment provide stronger evidence for rejecting the null
hypothesis than Alice's does?

B) In Bob's universe, is the _true_ population mean higher than it is in
Alice's universe?

By pure hypothesis testing alone, the correct answer to both questions is
"no", even though the intuitive answer to both questions is "yes"[1].

[0]
[http://andrewgelman.com/2009/05/18/noooooooooooooo/](http://andrewgelman.com/2009/05/18/noooooooooooooo/)

[1] Part of the problem is that we _do_ expect that, in Bob's universe, the
true population mean is highly likely to be higher, and this _is_ supported by
the data. Trouble is, the reason we expect that is not formally related to
hypothesis testing and t-tests/p-values.

~~~
eanzenberg
If you're doing a measurement, why have a null hypothesis? Alice should sample
the population at random, take the height measurements, calculate the average,
plot the distribution, calculate the variance. If the distribution is not
sufficiently smooth then continue to take measurements until it's smooth, or
unchanging. Then Alice is done discovering all there is to know about the
height distribution of the population she sampled. Same with Bob.

~~~
chimeracoder
> Alice should sample the population at random, take the height measurements,
> calculate the average, plot the distribution, calculate the variance

A student-t test is basically a lossy encoding of that information you
describe (the sample mean, the variance, and the sample size).

> If the distribution is not sufficiently smooth then continue to take
> measurements until it's smooth, or unchanging.

From a frequentist standpoint, this would be considered sloppy and bad
methodology. You don't keep sampling until you get the results you want (or
until the null hypothesis is invalidated).

That said, as I mention in a comment above, the fact that you _can_ keep
sampling until the null hypothesis is invalidated (and that it is
mathematically always guaranteed to happen eventually) is a big problem with
the concept of hypothesis testing in the first place.

------
darawk
Is the p-value really not the probability of your results being due to chance?
Is that not a perfectly valid definition of it?

I suppose 'chance' is a little hand-wavy, but isn't a p-value just the
probability of your data given that your hypothesis is false? Isn't that
literally and precisely the probability that they occurred by chance?

~~~
haberman
Imagine I handed you a 20-sided die. I claim it says 7 on every side, but I
might be lying. You roll a 7. What are the chances it actually has 7 on every
side?

You can't actually say unless you either (1) roll the die more times, or (2)
assume something about the probability that I gave you an all-7's die to begin
with.

Doing (2) is useless, because that exactly the question we are trying to
answer.

For example, suppose I perform this experiment all the time and I know that I
give an all-7's die only 1% of the time. With this new information, you could
actually calculate the probability of an all-7's die given a 7 roll. Of all
the possible outcomes, you could add up the ones where I gave you an all-7s
die _and_ the ones where I gave you a normal die but you just rolled 7. Then
you could divide that by the total number of possible outcomes.

But this would give you a totally different number than if I give you an
all-7's die 99% of the time. And the problem is that you don't have any
information about what kind of die you have before you roll it. You're trying
to figure out which world we live in -- one where your hypothesis is true or
one where it's not.

(I am pretty sure that what I wrote above is true. But one thing I'm not as
clear on is how multiple rolls of the die actually can establish confidence
percentages. How many rolls does it take to actually establish confidence?
Would love to hear from any stats experts about that.)

~~~
aidenn0
rolling the die more times just lowers the P value. You still can't make a
definitive statement on the probability that it's an all-7's die without
assuming something about the prior probability.

However, if you can bound the prior probability on the low end, you can make
meaningful answers. Let's say you think there's at least a 1-in-a-billion
chance that you have an all-7's die. After five rolls in a row of 7, there's
at least a 0.3% chance that you were handed an all-7's die. After six rolls
there's a 6% chance, and so on.

Usually this quantitative analysis isn't formally done, since the priors can
always be debated, but rather a very small P value is demanded for very
unlikely events.

------
FrankyHollywood
Next step is explaining social science students the meaning of 'randomness' :)

Really, the amazing amount of bullshit social studies I have seen 'proven' by
statistics. Amazing new insights like 'if children wear green shirts, while
the teacher has a blue shirt the cognitive attention span is 12.3% higher than
children wearing purple shirts. The effect was measured with a significance of
bla bla bla.'

Software like SPSS facilitates this even more. People with no notion of random
effects or probablilty theory click on the 'proof my research' button and even
get it published.

So a lot more work in this area!

------
rcthompson
In undergrad, I learned how to about p-values but never quite understood how
they were actually useful. Now, as a bioinformatics graduate student, I've
come to understand that my original instinct was right all along.

------
carbocation
Statistical significance is difficult to ensure. Certainly, one should be
suspicious if 0.05 is _ever_ used as a significance threshold, because it's
unlikely that exactly one hypothesis under one regime was tested in any given
paper.

I am glad that the article's headline is clear that it's time to stop _mis_
using P-values. Tests for statistical significance should still be used, and
to abandon them would be foolish. In a sense, though, they are the beginning,
not the end, of assessment.

------
eanzenberg
p-value analysis has its big caveats like multiple comparisons, but Bayesian
has its own, such as it's extremely hard to calculate priors. Both are
challenging to use in difficult analysis and both can be abused.

~~~
arcanus
> such as it's extremely hard to calculate priors.

I think you mean it is difficult to formulate priors? Typically, calculating a
prior only involves sampling from a distribution.

~~~
eanzenberg
Let's use this example: [http://betterexplained.com/articles/an-intuitive-and-
short-e...](http://betterexplained.com/articles/an-intuitive-and-short-
explanation-of-bayes-theorem/)

-1% of women have breast cancer (and therefore 99% do not).

-80% of mammograms detect breast cancer when it is there (and therefore 20% miss it).

-9.6% of mammograms detect breast cancer when it’s not there (and therefore 90.4% correctly return a negative result).

For generating a prior, I posit it's really hard to determine that 1% of women
have breast cancer, and this prior is almost linearly sensitive to whether you
actually have cancer:

prior | % likelihood of cancer

2% | 16%

1% | 7.8%

0.5% | 4%

Why do I think it's hard to determine the prior? Imagine you are a 34 yrs old
woman in SF. Do you use the rate of breast cancer for only 34 yr olds? For the
range of 30-40 yr olds? 30-65 yr olds? For women only in SF? In the bay area?
In California? Data from 2000-2016? 2010-2016? This becomes hard because
positive cases are rare (thank god), so the population of which you calculate
the prior from can be noisy. Imagine if the positive result occurs 1e-5
times.. you are talking about 10's of cases every million. Vary the prior
probability and your posterior probability will vary just as much.

~~~
arcanus
hmm, what you are discussing above is typically called the formulation of the
prior, not the calculation of it. And I agree that this formulation can be
problematic. It is almost certainly the most contentious element in Bayesian
inference.

However, I will note that your examples are largely overstating the problem.
The prior is not typically as subjective as you are implying. Furthermore, the
Bayesian prediction converges to the frequential result in the limit of
'large' data. If your results are so highly sensitive to your prior that the
results drastically change, then you either have:

1) insufficient data, at which point the frequentist results would also
certainly have been as bad or worse (the prior acts as a regularization) 2) a
misspecified prior

Either way, it is absolutely required that you perform tests to ensure your
results are _not_ greatly sensitive to your choice of prior, or if so that
this is clearly noted as a modeling assumption. And bayes naturally has many
means to test precisely this, such as cross-validation, bayes-factors,
hierarchical, etc.

I also note that this is why I typically prefer Jeffreys Priors in my work. I
mention this so that one can see that not all priors are 'subjective'. These
are difficult to use in some fields, admittedly.

------
benjaminmhaley
The #1 problem is with p-values is the word "significant". We should use
"detectable" instead. Significant implies meaningful to most people, but not
in a statistical context. This is quite confusing. Detectable is better
because the mainstream meaning aligns with the jargon.

So:

> "Discovering statistically significant biclusters in gene expression data"

becomes:

> "Discovering statistically detectable biclusters in gene expression data"

This rephrasing makes it evident that "statistically detectable" adds little
to the title. So the title becomes

> "Discovering biclusters in gene expression data"

A better title.

------
giardini
This excellent and amusing article by Gerd Gigerenzer discusses the history of
p-values and their (mis)use:

"Mindless Statistics"

[http://library.mpib-
berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf](http://library.mpib-
berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf)

or

[http://www.unh.edu/halelab/BIOL933/papers/2004_Gigerenzer_JS...](http://www.unh.edu/halelab/BIOL933/papers/2004_Gigerenzer_JSE.pdf)

------
altrego99
The article tends to imply P-value should not be used at all, rather than
misused. P-value definitely means something. For example, if the p-value is
1e-10 (which is often possible), you know for sure that the hypothesis
generating has been disproved. So let me rephrase the title of the article -
"It's time to use P-Value correctly."

------
daodedickinson
Wow, I remember having these reservations about p-values when I took classes
in stats but whenever I brought them up a prof. would wave their hands and be
dismissive. They gave me a degree in political science, but I felt that
political science was an oxymoron and it left me with no respect for the
field.

~~~
chimeracoder
> They gave me a degree in political science, but I felt that political
> science was an oxymoron and it left me with no respect for the field.

For what it's worth, Andrew Gelman (quoted in the article) is one of the most
pre-eminent Bayesian statisticians alive, and is a professor in both the
department of Statistics and Political Science!

"Political Science" need not be an oxymoron, even if a lot of self-professed
political scientists use rather unscientific methods.

------
purpled_haze
...and I could not find the word "correlation" anywhere in this article
discussion statistics, harm, misunderstanding, and studies. Shame.

------
mikeskim
If people would just learn to resample (cross validate, use subsampling or the
bootstrap), we wouldn't be having this pointless discussion at all.

~~~
vasilipupkin
In theory, you are correct, but some data sets are too small or make it too
difficult to cross validate in a way, that is meaningful to the original
problem

------
ianamartin
This is one of the best conversation threads I've ever seen on HN. It's both
polite and informative.

I want to toss in my own thoughts here.

Since I've spent the vast majority of my tech career in the Market Research
industry (hello, bias!), I'm tempted to say that one of the most frequent
intersections between statistical science and business decisions happens in
that world.

Product testing, shopper marketing, A/B testing . . . these are pretty common
fare these days. But I feel like the MR people are sort of their own worst
enemy in many cases.

It's a fairly recent development that MR people are even allowed a seat at the
table for major product or business decisions. And when the data nerds show up
at the meeting, we have to make human communication decisions that are
difficult.

I can't show up at the C-suite and lecture company executives about the finer
points of statistical philosophy. When I'm presenting findings to stake-
holders, it's my job to abstract the details and present something that makes
a coherent case for a decision, based on the data we have available.

It is sinfully attractive to go tell your boss's boss's boss that we have a
threshold--a number we can point to. If this number turns out to be smaller
than .05, this project is a go.

Three months later, you go back to that boss and tell him the number came back
and it was .0499999. The boss says, "Okay, go!" And then you are all, "Wait,
wait, wait. Hang on a second. Let's talk about this."

My god, what have I done?

The practical reality of the intersection of statistics and business is a
harsh one. We have to do better. In terms of leaky abstractions, the
communication of data science to business decision makers is quite possibly
the leaky-est of all.

Why is it so leaky? I have two points about this.

1) Statistics is one of the most existentially depressing fields of study.
There is no acceptance; there is no love; there is nothing positive about it.
Ever.

Statistics is always about rejection and failure. We never accept or affirm a
hypothesis. We only ever reject the null hypothesis or we fail to reject it.
That's it.

2) In business, we tend to be very very sloppy about formulating our
hypotheses. Sometimes we don't even really think about them at all.

Take a common case for market research. New product testing. We do a rep
sample with a decent size (say, 1800 potential product buyers) and we randomly
show five different products, one of which is the product the person already
owns/uses (because that's called control /s). The other 4 products are
variations on a theme with different attributes.

What's the null hypothesis here? Does it ever get discussed?

What's the alternative hypothesis?

The implicit and never-talked-about null is that all things being equal, there
is no difference between the distribution of purchase likelihood among all
products. The alternative is that there is a real difference on a scale of
likely to purchase.

The implicit and intuitive assumption is that there is something about that
feature set that drives the difference. (I'm looking at you, Max Diff)

But that's not real. It's not a part of the test. The only test you can do in
that situation is to check if those aggregate distributions are different from
each other. The real null is that they are the same, and the alternative is
that they are different.

All you can do with statistics is tell if two distributions are isomorphic.

Now, who wants to try to explain any of that to your CEO? No one does. Your
CEO doesn't want it, you don't want it, your girlfriend doesn't want it. No
one wants it.

So we try to abstract, and I feel like we mostly fail at doing a good job of
that.

This is getting really long, and I don't want to rant. So to finish up, an
idea for more effective uses of data science as it interacts with the business
world:

I agree, let's stop talking about p values. Let's work harder and funnel the
results of those MR studies into practical models of the business' future.
Let's take the research and pipe it into Bayesian expected value models.

Let's stop showing stacked bar charts to execs and expecting them to make good
decisions based on weak evidence we got from hypotheses we didn't really think
about in the first place.

Some of this might come across as a rant. I hope it is not taken that way.
This is a real problem that I've been thinking about for a long time. And I
don't mean to step on anyone's toes. I have certainly committed many of the
data sins that I'm deriding above.

Edited to add:

The real workings of statistics are unintuitive. I'm not saying that they are
wrong. But in working with people for years now, I understand the confusion.
It's a psychological problem. Hypotheses are either not really well though out
or not considered in an organized way, in my experience.

A hypothesis is not concrete in many practical cases. It's a thought. An idea,
perhaps. It's often a thing that floats around in your mind, or maybe you paid
some lip service and tossed it into your note-taking app.

Data seem much more real. You download a few gigabytes of data and start
working on it. It's quite easy to get confused.

I have real data! This is tangible stuff. Thinking of things properly and
evaluating the probability of your data given the hypothesis is _hard_. Your
data seems much more concrete. These are real people answering real questions
about X.

Even for people who are really hell-bent on statistical rigor, this is a
challenge.

------
amluto
If they held the same meeting 20 times, would they reach the same conclusion
in 19 of those meetings?

On a more serious note, I think that the use of the word "significant" to mean
"the effect is reasonably likely to exist by some standard" should be
abolished.

Webster's 1913 dictionary says:

> Deserving to be considered; important; momentous; as, a significant event.

Statisticians don't use "significant" to mean important at all -- they use it
to mean "I could detect it". This is bad when someone publishes a paper saying
"I found that some drug significantly reduces such-and-such" \-- this could
just mean that they did a HUGE study and found that the drug reliably had some
completely unimportant effect. It's much worse when it's negated, though.
Think about all the headlines that say that some treatment "did not have a
significant effect". This is basically meaningless. I could do a study finding
that exercise has no significant effect on fitness, for example, by making the
study small enough.

A good friend of mine suggested that statisticians replace "significant" with
"discernible". So next time someone does a small study, they might find that
"eating fat had no discernible effect on weight gain", and perhaps readers
would then ask the obvious question, which is "how hard did you look?".

This would also help people doing very good research make less wishy-washy
conclusions. For example, suppose that "vaccines have no discernible effect on
autism rates". This is probably true in a number of studies, but _it 's the
wrong analysis_. If researchers who did these studies had to state the
conclusions in a silly manner like that, maybe they'd find a more useful
analysis to do.

Hint: doing big studies just so you can fail to find an effect is nonsensical.
Instead, do big studies so you can put a tight upper bound on the effect.
Don't tell me that vaccines don't have a significant (or discernible) effect
on autism. Tell me that, with 99.9% confidence, you have ruled out the
possibility that vaccines have caused more than ten autism cases in the entire
history of vaccines, and that, most likely, they've caused no cases whatsoever
(or whatever the right numbers are).

Edit: fixed an insignificant typo.

~~~
chc
That's the _second_ definition in Webster's 1913 edition. The first is:

> Fitted or designed to signify or make known something; having a meaning;
> standing as a sign or token; expressive or suggestive; as, a significant
> word or sound; a significant look.

It seems to me that this is the sense in which statisticians talk about
significance. It means that the results actually signify something rather than
just being meaningless noise.

~~~
amluto
Interesting. That definition seems like a bit of a stretch in this context to
me. Results of trials aren't "fitted or designed" to signify -- they are or
are not significant by the p-value standard, and whether they are or are not
is random (which is the whole point).

In any event, I suspect that, among most currently living English speakers,
the second definition is what comes to mind.

------
eanzenberg
5% (1 in 20) is a pretty weak threshold to pass. Let's go 5 sigma (p < 3e-7)
for discoveries and reserve 0.05 < p < 3e-7 for stuff we should take closer
looks at.

~~~
carbocation
> 5% (1 in 20) is a pretty weak threshold to pass. Let's go 5 sigma (p < 3e-7)
> for discoveries and reserve 0.05 < p < 3e-7 for stuff we should take closer
> looks at.

This would still end up leading to misuse of P-values. Let's say you're doing
a genome-wide association study on several hundred thousand SNPs. The
traditional threshold is 5e-8 (0.05 / 1,000,000 effective tests). So using
3e-7 for the threshold for "discovery", you'd count many things as discovery
that shouldn't be so.

On the other hand, let's say you do a study with 20 people with cancer. You
give 10 of them a drug, the other 10 a placebo. All 10 with the drug survive;
all 10 with the placebo die. Your P value is 0.0002. This doesn't count as
discovery, but clinically I know what my judgment is going to be.

This is all to say that the misuse of P-values does not just come from the
threshold.

~~~
eanzenberg
Multiple comparisons? 1 million independent tests? Hello?

>Your P value is 0.0002. This doesn't count as discovery, but clinically I
know what my judgment is going to be.

>Let's go 5 sigma (p < 3e-7) for discoveries and reserve 0.05 < p < 3e-7 for
stuff we should take closer looks at.

^Means you should start a new trial with n > 20 given the same placebo/drug
split.

~~~
minimaxir
In the scientific world, it is economically infeasable (cost/time) for 1
million tests for a given hypothesis. Even n > 20 can be difficult for certain
studies. Bootstrapping the results to simulate 1 million trials won't fix the
aforementioned issue either.

~~~
_snydly
> Even n > 20 can be difficult for certain studies.

For some expensive experiments even n=3 is difficult to reach. "Should we do
another replicate and go from n=2 to n=3, or should we hire another post-doc?"
is a relatively common (rhetorical) question.

