

A/B testing pitfalls and lessons learned - samps
http://exp-platform.com/Documents/IEEEComputer2007OnlineExperiments.pdf

======
btilly
The advice is mixed in quality.

The worst piece of advice is to only use one metric, which is some complicated
mix of other metrics. The basic reason they want this is to give a clear go/no
go signal that everyone agrees on. Perhaps if you have to deal with the
politics of a larger organization, that's a good idea. But if you're a small
company, the extra detail you get about how your product is used from tracking
multiple metrics is very good for helping clarify what you're trying to do,
and how you want to do it.

Furthermore the act of creating a complex weighted measure is pushing the
argument elsewhere. And when you're still trying to figure out how your site
is actually performing, you don't have the context to know what measure to
use. Furthermore you won't be able to use the obvious chi-square test (or its
better relative, the g-test). There is no need to over-complicate the
statistics.

The idea of having a hashing function to do test assignment is one that I had
not considered. I've always suggested the obvious rand() at assignment time
approach, which accomplishes the same thing but with more overhead at run
time. I'd caution people who try the hashing approach to use a standard
library, because it would be really, really easy to have the website think
that assignment is done one way while your analysis assumes that it is done in
another.

The minimum duration point is interesting...and somewhat useless. When I was
preparing my presentation a few years ago I found out that, even if you know
exactly how much better A is than B, you can't predict to within an order of
magnitude how quickly your experiment will show it. My attitude is the much
simpler, "The test takes however long it will take, and you can't really know
how long that will be in advance." After you've done a few tests, people will
have a good enough idea for a back of the envelope estimate.

The other advice seemed good, and mostly was obvious to me. But I have more
experience with A/B testing than most do.

~~~
patio11
Hashing has one major advantage over random assignment: reproducibility. If
you have some way of IDing Mary Smith, you can consistently serve her the same
tests on all machines, and expose the tests she is participating in to bug
investigators, without having to actually store a users => chosen test
alternatives map anywhere. Those can get fairly sizable and the access
patterns suck.

A/Bingo does it by hashing taking MD5(user_id . test_name). You have to store
user_I'd, but you're doing that anyhow.

~~~
btilly
If you do random assignment, and save the random assignments somewhere, then
you can also get reproducibility. If you do random assignment, and don't save
the random assignments anywhere, then you have no idea how many people were in
your test versions. Which is not a good idea. (You can estimate this data.
I've done it. But doing it properly is surprisingly tricky. It is very, very
easy to do it wrong.)

There are several benefits to this approach.

The first benefit is that if you're testing a particular page, you can easily
make your test only include people who have hit that page. This will cause
results to converge more quickly than if you don't know which people on your
site actually hit that page.

The second, and sometimes critical, benefit is that you know exactly _when_
someone entered the test. Multiple times when testing things with a longer
sales cycle I've encountered the situation where a particular test version
causes the sales cycle to become compressed, but may or may not provide a
long-term improvement in conversion. Access to data about when people entered
your test allows you to examine A/B test results only for people who have been
in the test long enough to be likely to have completed either version.

A third benefit is that if you're testing multiple versions, then you can just
continue the test and drop poorly performing versions as you prove that they
are suboptimal.

The downside, as you say, is that the map of who is in what version can get
very large. In my experience, though, the access patterns are not that bad.
Particularly not if you are already using sessions, and can cache that
information in the current session, so that most page hits don't have to fetch
the A/B test version. Furthermore this is not data that you need to join to
anything else on your live website. Therefore it is a perfect candidate to
move somewhere like Redis.

------
mwexler
The MS site (<http://exp-platform.com/>) has more of Kohavi's papers on how MS
uses experimentation across their systems. Worth a perusal if you want more
depth than this overview document.

