
How Naive AB Testing Goes Wrong and How to Fix It - ewulczyn
http://ewulczyn.github.io/How_Naive_AB_Testing_Goes_Wrong/
======
yummyfajitas
By the way, as part of a research project which will (hopefully) become public
soon, I've been thoroughly investigating these issues. It's important to note
that Bayesian and Sequential testing are orthogonal.

You can have a sequential frequentist test (typically the Sequential
Probability Ratio Test), which is standard frequentist hypothesis testing that
allows multiple looks.

Similarly, you can use a Bayesian test in a "deadbeat" (run the test out to
N=10,000, stop and choose the best) manner - you simply use the posterior to
determine which variant is superior (i.e. choose the one with the highest
mean).

tl;dr; The core difference between deadbeat and sequential is whether or not
you stop the test at a fixed time. The core difference between Bayesian and
Frequentist (as described in this article) is whether you maximize a utility
function (e.g. revenue) or attempt to determine "truth".

------
ewulczyn
This post describes the shortcomings of hypothesis testing for web
optimization and proposes a method of repeated Bayesian A/B testing to address
them.

------
lifeisstillgood
I only understood one word in three here, and that worries me. I am envisaging
a world of "software literacy" where the ability to write decent correct code
is as common as writing human language was (sometime between now and the
renaissance.)

I would say that for the next hundred years software is likely to be a 10x
skill - economically, intellectually and culturally valuable.

But there will be other skills dragged up with this - one likely one is the
statistical rigour to be able to make use of your new found software skills.

If "stats" is a necessary adjunct to software skills what other skills will be
part of the curriculum for the next century?

~~~
patio11
Most A/B testing uses statistical tests which are readily understood by
mathematically inclined high schoolers, and some of the most effective stats-
based decisionmaking processes in the world are designed by arbitrarily smart
people and executed by average high school graduates.

For example, "Here's a two-dimensional graph where the Y axis is a measured
feature of the part that just rolled off the assembly line and the X axis is
the trial number. Every five minutes, pull a part off the line and put a tick
mark at the appropriate place. See this red line on the graph? Two rules: a)
If you ever make three ticks above it, in succession, SHUT DOWN THE LINE. b)
If you ever make eight ticks above it in a shift, SHUT DOWN THE LINE. We will
never under any circumstances consequence an employee for hitting the Big Red
Button as a result of these two rules, but we will be most upset if the Big
Red Button is not pushed when these two rules indicate it should be pushed."

A particular company teaches this to, quite literally, even the people who
sweep the floors in its factories.

------
mszyndel
Could someone please explain it in short and plain English for non-
statisticians in here?

~~~
yummyfajitas
Key points:

1) The goal is not to figure out if Variation is better than Control (this is
what classical hypothesis testing is about). The goal is to make the revenue
maximizing decision. If Variation and Control are identical, there is no harm
in a false positive.

2) The author finds it desirable to allow the test to run for a variable
amount of time, and stop it when statistical significance is achieved. This is
called _sequential testing_.

The rest is the details on how to actually accomplish these goals.

