
Lessons learned A/B testing with GAE/Bingo - llambda
http://bjk5.com/post/28269263789/lessons-learned-a-b-testing-with-gae-bingo
======
btilly
I really like the idea of having many different metrics that you automatically
track for every test you run. I've been telling people to do that for years,
and saying that the fact that standard A/B test frameworks don't is good
enough reason to roll your own.

However suppose that you have 20 metrics you are following, and you run 20
tests. The odds are that by chance 4 times you'll have tests showing 99%
confidence on random metrics for random results. This is just a side-effect of
having many tests and many metrics.

Therefore if you find yourself in that situation, you should be very
predisposed to assume that random results you were not expecting on random
metrics that seem unconnected to your test really are due to random chance.
Because the odds of weird chance results are higher than you would have
guessed.

~~~
kamens
Amen to that. This is an issue we've been very aware of recently, and we're
discussing various possibilities for mitigating.

The point about using historical graphs instead of single significance numbers
reduces, but does not solve, the likelihood of making a mistake due to one of
these chance results.

~~~
btilly
Historical graphs do not actually help you at all. The conditional probability
of seeing a particular historical graph is fixed once you state the current
number of observations and significance number. In particular the historical
graph gives you no information about the true underlying probability that you
did not already have if you know the number of observations and the current
significance number.

~~~
kamens
Say your current significance number currently claims 99% significance in
favor of alternative A.

A historical graph can show you that 12 hours ago, there was a stat sig in
favor of alternative B (we've seen this plenty of times).

Some metrics take a while to stabilize, and the more metrics you have, the
more likely you are to run into these particular situations (as you mention
above). A graph helps you understand recent variability in the metric.

~~~
btilly
Recent variability _shouldn't_ matter. Assuming you don't have have some
external factor driving variability (like an email program), the only
statistically important fact that past variability tells you you is an
indirect estimate of how many people you have in the test. But _that_ is a
number that you already have direct access to.

If this this does not seem true for you, then you need to review how you are
doing your stats. Because something sounds fishy. Perhaps, for example, you
are plugging in the number of times the target event happened instead of the
number of people that the target event happened to? Then observations are
correlated, which throws off your statistical tests? There are a lot of ways
to do the stats wrong, and I like to comment that if you do, then you'll come
up with wrong answers - and believe them.

~~~
kamens
Agreed. You're absolutely right that it's an indirect estimate of how many
people you have in the test and doesn't add any extra information.

The only thing I'm suggesting is that looking at a graph that is wildly
fluctuating up and down can be more helpful and easily accessible than asking
someone who isn't an expert in stats (I raise my hand) to look at a number of
participants and immediately understand its effect on this specific
experiment's variability.

No, we are tracking the number of people that the target event happened to --
but we've also tracked the number of times it happens, and as mentioned in the
article, we are aware that these specific metrics are extremely outlier-prone
and are looking into ways of improving this. All advice welcome, your tips are
much appreciated.

~~~
btilly
If you're using a chi-square or its better relative, the g-test, then outliers
do not matter.

If you're using ANOVA or Student's T-test, don't do that. They assume that
you're sampling from normal distributions and you are not.

What you should do is truncate extreme outliers, then use a z-test. Start at
[http://elem.com/~btilly/effective-ab-
testing/index.html#asli...](http://elem.com/~btilly/effective-ab-
testing/index.html#aslide94) and use the arrow keys to go through about 2
sections to for details on that.

If you have any questions, shoot them to btilly@gmail.com.

~~~
kamens
This is awesome help, thank you very much.

~~~
btilly
You are welcome. Glad it helped.

------
xiaoma
I really like the suggestion to do A/A tests. A lot of Seth Roberts-style n=1
studies I've done on my nutrition, athletic performance, memory, etc... lead
me to completely erroneous conclusions due to not setting the significance
threshold high enough before acting on an experiment.

