
A/B Testing, from scratch - mottalrd
http://www.alfredo.motta.name/ab-testing-from-scratch/
======
btilly
A lot of math detail, but a naive reader reading it will still make the
mistake outlined in [http://www.evanmiller.org/how-not-to-run-an-ab-
test.html](http://www.evanmiller.org/how-not-to-run-an-ab-test.html). And a
less naive reader will still be left with no guidance on how to make real-
world decisions if you weren't so lucky as to get a strong statistical result.

Here is a much, much simpler strategy that I can suggest for business users.

1\. Figure out the longest that you can afford to run your test for.

2\. Figure out the number of conversions N you expect in that time.

3\. Start running your test, randomly splitting visitors into A and B.

4\. Stop the test as soon as one version has sqrt(N) more conversions than the
other. Else wait until you get to N conversions between them, and go with what
is ahead.

Here are some comments on the procedure.

Stopping before N/2 total conversions is roughly the same certainty as
stopping at 95% confidence. Stopping after that is an admission that you are
crossing your fingers and going with an educated guess, and it is not feasible
to collect enough data to get a better answer.

If one version has a conversion rate which is (1 + 2/sqrt(n)) better than the
other, you pretty reliably choose right. The flip side of this is that you'll
make a lot of mistakes if you get much below that threshold. If those
potential errors are too big, then A/B testing is not going to work well for
you because you don't have enough data for statistics.

~~~
danieltillett
btilly where does this approach come from?

~~~
btilly
My scratchbook. :-)

I keep meaning to write it up.

Mathematically it is similar [http://www.evanmiller.org/sequential-ab-
testing.html](http://www.evanmiller.org/sequential-ab-testing.html) but with
the difference that Evan produced a stopping rule that is a 95% confidence
interval, while mine is stopping at 95% confidence that you won't come to a
different conclusion by the end of the test. Otherwise the idea of the
analysis is the same.

(Note, under the null hypothesis reaching sqrt(N) then crossing 0 is exactly
as hard as reading 2 sqrt(N) as can be seen by taking any possible sequence
that does one, and swapping the values after first reaching sqrt(N) to get a
sequence that does the other.)

~~~
danieltillett
Please do write it up as I would certainly be interested in reading it.

------
RyJones
Glad to read this. So much so-called A/B testing done is of the naive variety
he outlines.

Much more here: [http://www.exp-
platform.com/Pages/default.aspx](http://www.exp-
platform.com/Pages/default.aspx)

disclosure: I worked with ExP a while ago.

~~~
mwexler
Second the above. The team at MS have done incredible work around the
pragmatics of being able to trust the findings of the online experiment and
possible areas of confounds. Anything Kohavi touches is worth paying attention
to.

------
nonbel
>"In order to estimate what is the true mean of our variants statisticians
rely on the Central Limit Theorem (CLT) 6 which states that the sampling
distribution of any statistic will be normal or nearly normal, if the sample
size is large enough."

Wow, no. This is a very dangerous misconception. First sentence of the
wikipedia page gives a much better definition:

"In probability theory, the central limit theorem (CLT) states that, given
certain conditions, the arithmetic mean of a sufficiently large number of
iterates of independent random variables, each with a well-defined expected
value and well-defined variance, will be approximately normally distributed,
regardless of the underlying distribution."

[https://en.wikipedia.org/wiki/Central_limit_theorem](https://en.wikipedia.org/wiki/Central_limit_theorem)

------
hornbaker
Good read, but pretty dense (or perhaps I'm just dense).

Personally, I get more practical value from
[http://elem.com/~btilly/effective-ab-
testing/](http://elem.com/~btilly/effective-ab-testing/).

------
franciscop
There's a possible big language barrier here from "urn". I guessed it was an
acronym, wikipedia[1] seemed like no help until I realized that "urn" if for
an actual vessel-like object (called an _urn_ ) so I could find the correct
article[2].

Which turns out to be such a simple concept, but maybe a note could help
future international readers.

[1]
[https://en.wikipedia.org/wiki/Urn_%28disambiguation%29](https://en.wikipedia.org/wiki/Urn_%28disambiguation%29)
[2]
[https://en.wikipedia.org/wiki/Urn_problem](https://en.wikipedia.org/wiki/Urn_problem)

~~~
sdrothrock
> I guessed it was an acronym

If it were an acronym, it would likely be capitalized as URN, similar to ASAP
or GNU.

> until I realized that "urn" if for an actual vessel-like object

A dictionary search seems like it would have helped here instead of Wikipedia.

> Which turns out to be such a simple concept, but maybe a note could help
> future international readers.

I'm not sure how native speakers in any language could determine whether a
given word would be problematic for international readers. There's no inherent
ambiguity in the word "urn" in this case for native speakers.

Do you have any ideas, or are you recommending a change for this particular
article?

Personally, I'd recommend that the author change "urn" to something more
common, like "jar" or "bottle." Since the type of the container itself isn't
even germane to the discussion, "box" or "container" might be clearer still.

~~~
franciscop
I've met many people who were learning Spanish and now I get a "feeling" when
something could be confusing, but I'm not sure of any general advice for
determining it otherwise. It's even more difficult when it's a particular
field like Statistics. In retrospective this is something I could have guessed
normally. Just some ideas to answer in a more "general" way from my point of
view:

\- "Statisticians love urns and, guess what, our problem can be modeled as an
extraction from two different urns."

Something that might help is noting that it's a typical problem like this:

\+ "Statisticians love urns and, guess what, our problem can be modeled as an
typical extraction from two different urns."

Note the addition of "typical". Just wording like that would give me a hint
that it's something that I'm unaware of. It's just like talking about any
other field. Compare these:

\- The two prisoners is a case widely known for its... (what two prisoners?
were they in the news? it could be _anything_ and seems non-googleable)

\+ The typical two prisoners problem is widely known for its... (oh, it's a
reference to a specific, famous problem, google gives a quick match)

------
duhruh
A better alternative to your simple a/b test would be the multi-armed bandit
algorithm. This will give you your answer the fastest, though it's slightly
harder to wrap your head around check it out here:
[http://stevehanov.ca/blog/index.php?id=132](http://stevehanov.ca/blog/index.php?id=132)

~~~
codyb
I liked the article because it's really fun to stick with something like that,
wrap my head around the math, and enjoy the exploration but the multi armed
bandit algorithm was supremely satisfying to read. Simple, intuitive, logical,
and with easy to tweak parameters. It doesn't get much better than that.

------
jeremywho
This article feels similar to "how to draw an owl".

