
A/B testing mistakes - ankneo
http://visualwebsiteoptimizer.com/split-testing-blog/seven-ab-testing-mistakes-to-stop-in-2013/
======
Jasber
I recently implemented A/B testing on a client's site using one of these
Javascript-based A/B testing tools (but not this one).

I hadn't used one before, so wanted to verify the data would actually be
accurate.

I did an A/A test, basically testing the same exact page––expecting the
results would be the same.

Not only were the results not the same, but they were off by a wide margin.

Given this, I don't know how I'm supposed to trust any of the data.

Has anyone else had experiences like this? Is A/B testing in Javascript just
not as reliable?

~~~
siddharthdeswal
There's a very simple reason for that.

You're seeing a difference between Control and Variation in an A/A test is
because a very small number of visitors have been tested. To explain, suppose
you toss a coin 10 times and 7 out of those it shows heads. Just based on
these 10 tosses, would you conclude that the coin is loaded? Probably not.
Suppose you tossed the coin a 100 times, it'll probably show heads maybe 43 or
47 or 51 or 52 times.

Point being, as you toss it more and more, the number of times it shows a
heads or tails comes closer and closer to 50% but you need to toss it a large
number of times to be fairly certain that it isn't loaded. The more you toss
it, the more certain you are. However, you'll only be more and more certain,
but never completely certain. VWO works on a similar principle. The more
number of times you toss up Control and Variation to visitors, the more
certain you become of either being better, worse or equal to each other.

If you'll read the post, the graph shows the fluctuations in the beginning,
after which things kind of settle down. In an A/A test, they'll settle down to
a very similar conversion rate.

Here's an article from the VWO Knowledgebase that'll help you with running an
A/B test correctly [http://visualwebsiteoptimizer.com/knowledge/how-to-
ideally-r...](http://visualwebsiteoptimizer.com/knowledge/how-to-ideally-run-
an-ab-split-test/)

~~~
chc
Some people actually suggest running A/A/B tests just to gauge how much noise
is in their numbers, though that requires even more visitors to achieve
statistical confidence since they're spread out among more options.

~~~
jfarmer
I've worked at companies that tried to do this before. It makes no sense and
shows the people running the A/B tests don't really understand the statistics
behind A/B testing.

If I'm running an A/A test at 95% confidence and a sufficient number of
visitors for whatever effect size I'm interested in, then 1 in 20 A/A tests
will register a false positive. That's what "95% confidence" means. It does
not mean there is "too much noise."

Moreover, in a proper A/B test, the A group and B group need to be independent
and identically distributed. So, in an A/A/B test, if the A/A disagree it
_shouldn't tell you anything about B_. That's what "independent" means.

If you want to be more confident you just increase your alpha. alpha=0.05 is
already too high for most consumer web apps anyhow, IMO, but go wild. 99%
confidence! Woo!

As a rule you want higher confidence when the cost of a mistake is high, e.g.,
this medicine gives people brain tumors! Oops.

~~~
mjw
Perhaps you could view this "A/A/B" test as a very crude form of
<http://en.wikipedia.org/wiki/Bootstrapping_(statistics)> method? At least if
you're resampling A1 and A2 from a pool A and then doing separate A1/B and
A2/B tests and looking at how much the resulting statistic varies between the
two runs.

Agreed this is a silly way to go about it, but there better-thought-out
bootstrapped confidence tests which could be used if you don't fully trust the
distributional assumptions behind (say) the t-test.

~~~
jfarmer
I wish! The words empirical distribution are music to my ears.

No, usually the rule people use is this: "If A1 and A2 show a statistically
significant difference, then do not reject the null hypothesis regardless of
A1/B or A2/B."

------
karolisd
8) Have a hypothesis of what you're testing and control for variables. Run a
MVT test if you're changing a lot of things. If the test wins and it's
implemented, everyone is happy and people don't ask too many questions. If it
loses, what have you learned? Test a hypothesis.

If a client looks at a comp for a test and asks to change something, I always
ask them, "What hypothesis are we testing with that change?"

