
A/B Testing is Expensive - jamiequint
http://jamiequint.com/ab-testing-is-expensive
======
btilly
If you want both a strategy for testing with limited data, and information
about the kinds of errors you are likely to encounter, you may want to read
[http://elem.com/~btilly/ab-testing-multiple-
looks/part2-limi...](http://elem.com/~btilly/ab-testing-multiple-
looks/part2-limited-data.html).

~~~
jamiequint
Awesome! Thanks for sharing.

------
ivankirigin
Great post!

When you're first starting out, positioning matters. At
[http://yesgraph.com](http://yesgraph.com) we've found copy AB tests to
produce incredible lift.

For example, people don't want to "invite" contacts. They do want to "email"
contacts though. It's the same flow, but a few words triggered massive lift.
The reason it is massive is because it is so unoptimized. So it is
specifically at the start where such small tests can matter.

~~~
jamiequint
Absolutely, small tests can make large differences.

However, you only have so many "bullets" to shoot at tests with a small
audience so you have to be very picky. Sounds like contact inviting is a key
viral feature of YesGraph so it makes a lot of sense to optimize there.

I'm not anti A/B testing at all, but anecdotally a lot of people I talk to
about this stuff fire their test bullets on the wrong things and end up with
not much to show for it.

A better understanding of psychology goes a long way into intuiting where you
may actually be able to see gains, and how to go about achieving them.
Sometimes this can be small changes producing large gains, although I have
seen large changes produce larger gains more often.

~~~
ivankirigin
Agreed

------
swalling
One cheap (in time and money) complement to A/B or multivariate testing that
the author doesn't mention is usability testing, specifically remote testing.
Before we launch a test, we always run remote usability tests.

Feedback like this should be taken with a grain of salt, since these people
are testers, not necessarily like your users in all respects. But it's still
really valuable. I've caught numerous errors that test data would not help me
understand easily.

Combine remote usability testing through something like usertesting.com with
prototyping, and you've got a really rapid way to get feedback on the cheap,
even if you don't have enough site visitors to get statistical significance on
a reasonable time frame.

------
insickness
If you want to test a sales page prior to product launch, it will be expensive
to A/B test because you can't use natural visitors.

I have a site with a few thousand visitors per day. I had a product I was
going to release. I was working up to a big product launch, building up the
anticipation over my email list and on the site itself. In that case, I
couldn't use natural traffic from the site to test the sales page prior to
launch. I had to use cold traffic such as adwords to see which version of the
page people responded to. I probably spent about $8k on traffic just crafting
and A/B testing the sales page.

But it was worth it. It was like a university education in marketing.
Marketing can be so counter-intuitive. So many things I expected to work did
not work and vice-versa. But once I had the final tested sales page in place
it worked and it worked well. I still get about a a 4% conversion. And most
importantly, I knew that when I did finally launch, I had a well-tested and
solid page that would convert the large initial influx of customers I got from
the building up process before the product launch.

------
ryanglasgow
Interesting read, but I would have to disagree. It's not difficult to reach
90% confidence with very a small sample size:

    
    
      - Variation A and B each receive 20 visits
      - Variation A receives 10 clicks while variation B receives 5 clicks
      - The confidence interval for Variation A is 90%
      (Source: https://mixpanel.com/labs/split-test-calculator)
    

Also, I wrote an article titled "Creating Successful Product Flows" that is
very relevant to this post: [https://medium.com/design-
startups/c41ffbce49a1](https://medium.com/design-startups/c41ffbce49a1)

~~~
graeme
Can someone with expertise comment on this? I once worked in a company where
the founders thought that the small samples were adequate. I thought that the
calculators were misleading with such small samples sizes, even though they
gave "high confidence".

But that was only based on my intuition, not math, and I've never seen anyone
give a good discussion of whether "90% confidence" is as definitive as it
sounds in the context of a very small sample.

~~~
ronaldx
It's a bit awkward to give a full answer to this, but this is to the best of
my understanding and explained as simply as is reasonable:

A small sample has less statistical 'power' to identify significant
differences where they exist. Put another way, a large sample is more likely
to give a true significant result than a small sample.

But, if you do see 10% significance(/90% confidence) in a small sample, this
is just as good as 10% significance in a large sample. Although the cutoff
point will be more rough in a smaller sample, it's a good standard practice to
round conservatively to account for this.

10% is unlikely to be considered a good result for statistics in either case -
you can engineer a result by doing 10 tests on nothing and there's a danger
you would have unknowingly or unconsciously done this, maybe (for example) by
not deciding the sample size in advance. However, there's also presumably
strong enough evidence against a harmful difference that you aren't likely to
lose anything by following these results.

It can be good idea to do numerous small investigative tests as justification
for bigger tests - relying on lots of small tests alone requires consideration
for multiple testing (e.g. Bonferroni correction).

~~~
vasilipupkin
"But, if you do see 10% significance(/90% confidence) in a small sample, this
is just as good as 10% significance in a large sample". That is not true,
strictly speaking. You are assuming that small sample describes the underlying
distribution well. But this may not be the case due to non-normality of the
distribution itself or potential biases

~~~
ronaldx
Cool point and I agree.

The sample has to represent the population, that's fundamental. If the sample
is so small that it can't characterise the population distribution, then you
have a problem anyway. If you're measuring a events that happen 1% of the time
(or 99% of the time), a sample of 100 is not nearly enough.

If you chose an appropriate non-parametric test to cover an unknown
distribution with a small sample, it maybe would have zero power (impossible
to give a significant result)

------
jtcchan
I agree with your points re: A/B testing early on for startups but I don't
think the conclusion is that A/B testing is an expensive option, it's more
like the wrong option.

Yes, you'll need moderate levels of traffic for split tests to be effective,
so if you don't have the traffic or time to wait around, you should be talking
to your users.

------
grinnick
I recently wrote a calculator which will tell you how many days it will take
to run an A/B test to 90% significance.

You just plug in

1\. the number of pageviews your page got in the last month

2\. the number of conversions that resulted from those pageviews

[http://abtestcalculator.com](http://abtestcalculator.com)

------
peeplaja
Here's an article on how to do conversion optimization with little to no
traffic [http://conversionxl.com/how-to-do-conversion-optimization-
wi...](http://conversionxl.com/how-to-do-conversion-optimization-with-very-
little-traffic/)

------
tersiag
An alternative could be cloud based eye tracking testing services such as
[http://www.gazehub.com/](http://www.gazehub.com/) where you can get a lot
info about how visitos navigate... and you dont need large volumes of traffic

------
badman_ting
In a lot of cases, I would say it's ridiculously cheap -- you can make some
very small changes and get a huge response. But in the context of a site that
isn't getting many visitors yet, this make sense.

------
codexity
As a practical matter, with limited data, you can't do A/B testing. You have
to just move forward, guessing on each step -- yet carefully tracking whether
the change brought an improvement.

