
Puzzling outcomes in A/B testing - ot
http://glinden.blogspot.co.uk/2012/07/puzzling-outcomes-in-ab-testing.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+GeekingWithGreg+(Geeking+with+Greg)&utm_content=Google+Reader
======
richardv
The attached PDF article is worth a read.

TL;DR... (Of the PDF not that article)

\---------------------

> Ensure that your statistic results are trustworthy. Incorrect results may
> cause bad ideas to be deployed of good ideas may be incorrectly ruled out.

> You also need to make sure taht your findings align with your businesses
> long term strategy so that you don't make the mistake of sacrificing long
> term growth for short term financial gain. ie. One time Bing had a bug,
> which served poor search results, so distinct queries went up 10% and click
> throughs on advertisements went up 30%. It would be a mistake to say this
> test was successful.

> Just because you see an upwards trend in a newly launched feature does not
> mean that your users like the feature more. (Delayed effect/primacy effect)

> Don't run an experiment longer than you originally intended. It does not
> provide you with extra statistical power. (Pick a duration and stick to it).
> Also do not stop your test early. (There is an algorithm to tell you when
> you have statistical confidence enough to be able to stop your test, so that
> the benefit can be gained accross the entire network but I can't remember
> what it's called. It's commonly used in Drug/Pharmaceuticals when test
> groups clearly have a fantastic response to drugs, and it would be cruel to
> restrict the drug to the placebo/control group.

> Got interesting/suprising results? Re-run your experiment again. Often there
> are underlying reasons which should be investigated.

\--------------------

> Don't make changes to your application if your average customers lifetime
> value will decline. ie Longterm v Shortterm.

> Quickly identify the Carryover Effect... Google/Amazon/Yahoo etc rely on
> something called the "bucket system" where users are split and an experiment
> from a bucket is selected for them. But a problem with this is the carryover
> effect where the same users impacted on one experiment receives the second
> experiment too (the effect can last weeks). Known as A/A testing. The
> exerpiment was reran with a larger test group and with local randomization.

\--------------------

I just enjoyed reading it. I can't say that this will be useful for even 1% of
readers here. (I certainly can't apply any of the methods described, but the
underlying principles are useful to know).

------
bermanoid
I've said it before and I'll say it again - if your business exists primarily
to help users get a task completed and move on, then you absofrigginglutely
should not be running A/B tests the way content sites run them. If you can
achieve greater click per user or time on site metrics by making your product
worse at its core function, then you _need_ to have analysts that know what
they're doing in those situations, and they need to be able to win in fights
with the business types that don't understand this stuff and assume higher
traffic or CTRs are automatically a win. You're not playing the same split
testing game as everyone else, and you have to have people with both the
expertise and (equally important) the authority to handle this special
situation properly.

------
mwexler
There are many more papers and presentations on this topic at Ron Kohavi's MS
group's home page, <http://www.exp-platform.com/Pages/default.aspx>. Most of
these papers are worth a read, as they dissect how and where online
experimentation can really muck things up if not handled properly. Some are
more statistical than others, but most all are relatively easy reads.

------
mute
More info: <http://icelandingpagedesign.com/#blog>

