
There’s a 5% chance that these results are total bullshit - allforJesse
http://metricsparrow.com/theres-a-5-chance-that-these-results-are-total-bullshit/
======
jordigh
> Stop saying: “We’ve reached 95% statistical significance.”

> And start saying: “There’s a 5% chance that these results are total
> bullshit.”

Argh, no, no, no and no!

95% significance is NOT 95% probability! When you select a confidence level of
a 95%, the probability that your results are nonsense is ZERO or ONE. There is
no probability statement associated to it. Just because something is unknown
does not mean that you can make a probability statement about it, and the
mathematics around statistical testing all depend on the assumption that the
parameter being tested is not random, merely unknown...

Rather, 95% statistical significance means, we got this number from a
procedure that 95% of the time produces the right thing, but we have no idea
whether this particular number we got is correct or not.

UNLESS!

Unless you're doing Bayesian stats. But in that case your procedure looks
completely different and produces very different probability intervals instead
of confidence intervals, and you don't talk about statistical significance at
all, but about raw probabilities.

~~~
hencq
I'm not really sure what you're trying to say.

> Rather, 95% statistical significance means, we got this number from a
> procedure that 95% of the time produces the right thing, but we have no idea
> whether this particular number we got is correct or not.

I.e. We got this number from a procedure and there's a 5% chance it didn't
produce the right thing.

~~~
Fomite
Nope. It's "If we did this infinitely more times, there's a 5% of those
samples wouldn't have significant results". It's a subtle but important
distinction.

Though I'm surprised that his advice wasn't "Report confidence intervals at
least". There's _much_ more meaningful information in a point estimate and
confidence interval than "p < 0.05"

~~~
rcthompson
Sorry, confidence intervals are just a different presentation of the same
information as p-values, and don't contain any more or less information.

~~~
Fomite
While built off the same information, and it's possible to do an ad hoc
significance test off it, confidence intervals tell you more about the spread
of estimate. _Especially_ if, as the author is suggesting, you're not even
reporting the actual p-value, but just whether or not it's below a particular
threshold.

------
scishop
No.

In Frequentist thinking; p=0.05 means that if there was in reality no
difference in your A and B and you repeated the experiment many times, 5% of
the observed differences would be equal to or greater than the difference you
just measured.

No probabilistic statement about the results being correct or incorrect can be
made from a Null-Hypothesis significance test.

~~~
allforJesse
(apologies for repeating from above)

If you have a few minutes to spare, I would very much welcome your thoughts so
that I can either correct the article, or take it down - The last thing I want
is for it to sit out there on the open internet as misinformation.

My goal was to create a framework which — while less mathematically accurate
(hence “rhetorical device”) — helped convey the seriousness of making business
decisions based on P = 0.05 to people for whom 95% statistical significance
doesn’t mean anything. And clearly, based on reactions here, I failed at that
goal.

So, if you’re game, I’ll quickly to walk you through my thinking, and you can
help me understand where I went wrong. Best way to contact?

------
CountBayesie
I've long argued that the biggest problem with orthodox NHST for A/B testing
is that you actually don't care about 'significance of effect' as much as you
do 'magnitude of effect'. Furthermore, p-values tell you nothing about the
range of possible improvements (or lack thereof) you're facing. Maybe you are
willing to risk potential losses for potentially huge gains, or maybe you
can't afford to lose a single customer and would rather exchange time for
certainty.

My favored approach I've outlined here[0]. Where the problem is basically
considered one of Bayesian parameter estimation. Benefits include:

1\. Output is a range of possible improvements so you can reason about
risk/reward for calling a test early.

2\. Allows the use of prior information to prevent very early stopping, and
provide better estimates early on.

3\. Every piece of the testing setup is, imho, easy to understand (ignore this
benefit if you can comfortably derive Student's T-distribution from first
principles)

[0] [https://www.countbayesie.com/blog/2015/4/25/bayesian-ab-
test...](https://www.countbayesie.com/blog/2015/4/25/bayesian-ab-testing)

------
JDDunn9
Lots of knit picking here. In plain English, confidence intervals are about
your results being bogus. You flipped 100 coins, all of them came up heads,
you conclude 100% of coin tosses come up heads. By chance, you got a very
unlikely sample that differed substantially from the population. You could
also conclude your A/B test is a success, when it was just randomly atypical.

------
thanatropism
You have to wonder: what else from their junior year in college did mr.
Avshalomov get completely wrong?

How many of the recent YC graduates fail at basic numeracy? Does node.js mean
you don't have to understand data structures and algorithms to successfully
"preneur" too?

I mean, in finance this doesn't do. Or in consulting. So there's adverse
selection to worry too.

------
sbov
I'm not a statistician, but lately I've been wondering:

When we're A/B testing code, the code is already written. If there's a 5%, or
even 15% chance of it being bullshit, who cares? The effort is usually exactly
the same if I switch or not.

It's my understanding that 95%, 99%, etc, were established for things that
require extra change. We don't want to spend extra time developing and
marketing a new drug if it isn't effective. We don't want to tell people to do
A instead of B if we aren't sure A is really better than B.

But in software I've already spent all the time I need to to implement the
variation on the feature. So given that, why do I need 95%?

I would appreciate if someone with more knowledge can answer this question.

Edit to add: I see a lot of answers about the cost to keep the code around.
What about A/B tests that don't require extra code, just different code? Most
of our A/B tests fall into this category.

~~~
jy133
Would you push a feature that negative affected your product? 95% confidence
you will be able to know if you're feature is indeed positive, negative, or
roughly neutral.

~~~
andreasklinger
I think the core question is:

validation of upside vs validation of downside

as in: i want to avoid pushing something that is worse but i am optimistic (up
to even indifferent) about how much something is better

personal opinion: data trains gut-feeling

------
mathattack
It gets even worse.

If you try 100 tests, and pick the 5 that pass the Statistically Significant
threshold, most likely all 5 are BS.

------
glaberficken
Is it just me or this sentence makes no mathematical sense at all?

"If you’re running squeaky clean A/B tests at 95% statistical significance and
you run 20 tests this year, odds are one of the results you report (and act
on) is going to be straight up wrong."

------
Fomite
"We’re taking techniques that were designed for static sample sizes and
applying them to continuous datasets" \- Wait, seriously? Do A/B testers not
use the _very_ well developed techniques that exist for time-series data?

------
Grue3
It just means A is slightly worse than B. Or equal to B. Or much worse than B,
but that is quite unlikely (way less than 5%).

