
A/B Testing Scale Cheat Sheet - btilly
http://bentilly.blogspot.com/2012/10/ab-testing-scale-cheat-sheet.html
======
tisme
Isn't cutting off at some confidence level on the 'do not do this' list for
A/B testing? So far I understood that you test until you reach a predefined
number of conversions, and then you check your significance to see if the
result you obtained is valid. Not until you 'hit confidence'.

See for instance: <http://www.evanmiller.org/how-not-to-run-an-ab-test.html>

Is there anybody here that can shed some light on this?

~~~
btilly
Yeah.

That's a blog post that I should get around to writing a rebuttal to some day.
Because it is widely quoted and off base.

In theoretical frequentist math world, it is correct. If you peek repeatedly,
eventually you'll come to confidence when there is no difference. Back in the
real world, it is perfectly acceptable to use a strategy like, _"We'll set a
really high confidence (eg 99.5%) for cut-off until we get to a couple of
thousand successes, and then we'll drop our standards substantially (eg 95%
cut-off). If we are forced to stop for business reasons, we'll choose whatever
happens to be ahead at the moment."_

And yes, I can use Bayesian statistics to demonstrate that following the
strategy that I describe creates acceptably low probabilities of making
somewhat wrong business decisions, while allowing you to make good business
decisions more quickly. And in practice people can follow it without needing a
strong statistical background. (If I did enough work I could come up with a
sophisticated optimal curve to use in making decisions. But I have not done
that work, and in practice explaining it would be more work than it is worth.)

Why is this? Two reasons.

The first is that you only really get "independent peeks" at different orders
of magnitude of data. Thus if you wait until you're past a small amount of
data, you don't get a strong "repeated looks" effect.

Secondly coming to the wrong decision only matters to a business when the
chosen option is substantially worse. If you follow a rule like what I gave,
your odds of accidentally making up your mind in the wrong way if there is a
business-significant difference are surprisingly low. For instance if you
would detect a 2% difference as significant, and there is a real underlying
difference of 1%, the odds that you're making the right decision right now
deciding at a 95% confidence level is 99.2%. And if the real difference is a
0.5% win, your odds of making the right decision right now is 91.5%. (This
despite the fact that you'd expect to need 16x as much data in order to even
have a good chance of _detecting_ a 0.5% win!)

Thus the decisions that you're making are usually correct. And on the
occasions where you make the wrong choice, the mistake is usually not
materially worse.

------
paraschopra
Great list Ben! In my experience of analyzing and commenting on A/B test
results of our (Visual Website Optimizer) customers, one of the most important
effects I have observed is the newness effect of variations. Visitors are
sometimes inclined to respond positively to a new variation just because it's
a change from plain, old boring control. Even though the newness effect fades
in after a couple of days, but it makes our customers prone to celebrating
early the success of their test. That is why we always recommending
calculating a preset number of days for which to run the test before even
looking at the result (we have a setting for that). And as a general rule of
thumb, we ask our customers to at least wait for 7 days before concluding
anything from the results.

~~~
btilly
The newness effect can be a really confusing PITA. Yet another strategy to use
for it for long-running tests is cohort analysis - analyze what people's
actions were starting x days after entering the test.

That said, if you turn it around, deliberately finding ways to use the newness
effect to your advantage can be very rewarding.

~~~
paraschopra
>That said, if you turn it around, deliberately finding ways to use the
newness effect to your advantage can be very rewarding.

I'm curious to hear more about this. How do you make use of the newness
effect? Have you practiced this in the wild?

~~~
btilly
As I suggested in [http://bentilly.blogspot.com/2012/09/ab-testing-vs-mab-
algor...](http://bentilly.blogspot.com/2012/09/ab-testing-vs-mab-algorithms-
its.html) you can use a MAB algorithm with a forgetting factor for this. I've
actually done it with a more sophisticated system, but the idea is similar -
boost whatever you have evidence is performing better right now. (You should
put in logic to avoid time-based fluctuations in performance throwing you
off.)

The one that I built had minor teething pains, but it is now working quite
well. And it is convenient to be able to every so often drop in new versions,
and remove ones which have proven unable to perform up to snuff even with a
newness effect, and just trust that it will Do The Right Thing.

------
hornbaker
Good article. This G-test calculator, also by Ben, is my favorite tool to pick
a/b test winners: [http://elem.com/~btilly/effective-ab-testing/g-test-
calculat...](http://elem.com/~btilly/effective-ab-testing/g-test-
calculator.html)

------
rscale
Two additional points:

1) Be careful about what you're actually testing. More clicks or conversions
is quite likely good, but over the long-term you want to track LTV. An effort
that gets a lot of low-LTV customers will have great metrics at launch but
will be disappointing down the road.

2) Consider segmenting your traffic on appropriate dimensions (campaign,
referrer, previous behavior, platform, day of week, time of day), because it's
common for a change to be amazingly effective (or ineffective), but only for
one segment. Those opportunities can be lost if you treat your customer-base
as a single segment.

~~~
btilly
Good points, but those would be improvements for people who have more volume.

As I indicated, tests of revenue (including LTV) are significantly harder to
do. Also segmenting data makes it more work to interpret, and means you need
to collect more data to draw conclusions.

If you have the data, by all means do it. If not, then try not to be too
unhappy that you can't do it yet.

~~~
rscale
I'd argue that if you have enough traffic to look for 1-2% improvements, you
have enough traffic to do basic segmentation and to start measuring value.

That said, I agree with your greater point that you always need to make sure
your tactics are aligned with your situation and strategy. For instance, if
you're still seeking product/market fit, don't worry about micro-
optimizations; just follow best practices for the big items and optimize them
later.

