
How to avoid the biggest A/B testing mistakes - jjanyan
http://muddylemon.com/2012/04/ab-multivariate-testing-for-landing-pages/
======
robrenaud
> “That’s so random.” is one of the more annoying clichés in recent
> circulation. In popular parlance the word “random” often means something
> between “unexpected” and “unusual.”

I think that's a good fit for using the word random. Consider building an
n-gram language model that tries to predict the next word in a sequence of
text given some small history. Better models make better predictions. The more
that word sequences seem unusual, unpredictable, or random to the model, the
worse the model is doing. Just like the "random" word is hard to predict given
the history, the "random" partitioning of users into experiments should be
hard to predict based on characteristics of the users.

------
mef
Thanks for this great article.

Both this article and Effective A/B Testing
(<http://elem.com/~btilly/effective-ab-testing/> slide 51) say changing the
proportions mid-test is bad. Can anyone elaborate on why this might be?

For example, if I'm randomly showing my existing landing page or a new landing
page to visitors at a 2/3-1/3 split and measuring CTR to the signup page, and
I change the proportion to 1/2-1/2, why would that skew results?

~~~
Estragon
I'm curious about this too. The individual tests should be identifiable
regardless of the proportions in which they were assigned.

------
mattacurtis
To avoid big A/B testing mistakes, perhaps you shouldn't use phrases like "To
counter that noise it is important to first “prove the Null hypothesis.” To
prove the Null hypothesis you..."

One of the most important underlying statistical principles in inference tests
is that the null hypothesis can never be proven. Any data you collect can only
reject the null hypothesis or fail to reject it.

~~~
muddylemon
Good point. Not sure where I picked up that phrasing - especially since the
wikipedia article linked by the phrase specifically says:

> It is important to understand that the null hypothesis can never be proven.
> A set of data can only reject a null hypothesis or fail to reject it. For
> example, if comparison of two groups (e.g.: treatment, no treatment) reveals
> no statistically significant difference between the two, it does not mean
> that there is no difference in reality. It only means that there is not
> enough evidence to reject the null hypothesis (in other words, the
> experiment fails to reject the null hypothesis).

I'm going to update the post to fix that mistake. Thanks.

Edit: I updated that section. This part and the next part of the series are
the ones I'm most anxious about because of all the math and the questions that
have right and wrong answers.

~~~
mattacurtis
Here is better phrasing or a good way to think about hypothesis testing:

"The goal of the test is to determine if the null hypothesis can be rejected.
A statistical test can either reject (prove false) or fail to reject (fail to
prove false) a null hypothesis, but never prove it true (i.e., failing to
reject a null hypothesis does not prove it true)." (Wikipedia)

I might also change some of the wording of these:

"The Null Hypothesis states that if you don’t change anything than nothing
will be different."

The null hypothesis states that there are no differences in X (where X is the
metric you are using to evaluate the performance of the different experiences,
or recipes. X could be things like Conversion Rate, Click Rate, Time on Page,
Revenue per Visitor etc) between the control experience and any number of test
experiences.

"If you divide your traffic and see significant differences in your metrics
you have failed the Null hypothesis."

I would just say "...you would reject the null hypothesis." In this case,
you're basically saying that there is evidence to suggest that the differences
in X (where X is the variable you're measuring) are not due to chance / noise
/ natural variation alone. By saying "you would reject the null hypothesis",
the implication is that you would accept the alternative hypothesis. Remember,
the null is that there are no differences in X and the alternative is that
there are differences in X (where X is your success metric).

"Failing the Null hypothesis means that either you have not collected enough
results to even out the noise or there is something wrong with the algorithm
that you are using to segment your traffic."

By "Failing the Null hypothesis," do you mean "rejecting it" or "failing to
reject it?"

If you mean "rejecting it", I'm not sure how to read what you wrote.

If you mean "failing to reject it," you're basically saying here that if you
don't reject the null hypothesis, it is due to one of 3 things (or a
combination of the 3):

1) Your sample size is not large enough

This is a tricky boat to get into. Any difference in the metric your are
testing for will reach statistical significance with a large enough sample
size. If you're interested more in this, go look at the math formulas.

You should decide ahead of time how long you want to run a test for, knowing
how much traffic the page you're testing will receive, and how many
experiences / groups / segments you are splitting this traffic into.

If you don't reach significance, it could just be that the differences you
were testing for are too small.

Knowing the standard deviation / variance in the metric you're testing can
help you understand which one of the above it is (sample size is too small or
differences you are tested are too small)

2) You don't have enough evidence to suggest that any differences you see are
due to anything other than variation / chance / noise.

This can happen a lot when the changes / differences you're testing are too
small.

3) You have a Type II error

You have failed to reject a false null hypothesis. The test should have told
you there was a difference, but for whatever reason, the sample data you saw
did not provide you with evidence to do so.

~~~
mattacurtis
Oh - but I do also want to say that you did a great job on 99% of the article.
You hit on pretty much all the major considerations, biases and issues one
could run into when testing.

------
alchow8
Do you have any recommendations on a good stats book (easy to read/understand
but not as low level as a "dummies" book)?

~~~
btilly
All of the math you need to do straightforward A/B testing is in
elem.com/~btilly/effective-ab-testing/. Along with a lot of other useful
information. (Don't try to comprehend the whole thing in one sitting.)

~~~
muddylemon
That's a very good resource. I have a copy that I've printed out and written
all over.

My next post in the series is going to be about "the math" and I'll likely
refer to your presentation quite a bit. Do you have an audio or video
recording of it being presented?

~~~
btilly
Sorry, I don't. I only presented it once, and that session was not recorded.

