

Appsumo reveals its A/B testing secret: only 1 out of 8 tests produce results - paraschopra
http://visualwebsiteoptimizer.com/split-testing-blog/a-b-testing-tips/

======
patio11
This is roughly my experience, too -- take a look at how many null results
("no significant difference") I rack up:
<http://www.bingocardcreator.com/abingo/results>

To add insult to injury, half of the remainder were significant... in the
wrong direction.

(Edit to add: null results aren't failures, though. What is the Edison quote:
you now know one more thing that didn't work.)

~~~
tomjen3
I wouldn't call them failed tests. In science a null result mean that you
didn't learn anything but in this case who cares if p is 0.6? It is still more
likely to make you money.

~~~
bdonlan
The thing about non-statistically-significant tests is you can't, in fact, say
"it is still more likely to make you money." You can say that the B set did in
fact make you more money this time, but you can't rule out that being luck of
the draw. It might even have a (very small) bias AGAINST making you money, but
that bias got lost in the noise. That's why it's important to discard non-
statistically-significant tests (or expand the sample size if you _really_
want to know...)

------
ig1
These aren't failures, if the A/B test shows no impact or that the change is a
negative, then both of those add value in telling you what to concentrate on.

Before people did A/B test, it was generally assumed on Facebook ads that the
title was the second most important factor on CTR so people spent a lot of
time tuning title copy.

After lots of people did A/B testing it turns out that title copy has almost
no impact, people switched their title to Chinese (for a non-chinese speaking
audience) without seeing any change. That "no significant difference" means
that thousands of man hours can be spent elsewhere rather than copywriting ad
titles.

~~~
paraschopra
I'd be wary of concluding too much from other people's A/B tests. Yes, they
serve as a good starting point but doing A/B tests yourself is still
imperative. Case in point: Noah found out headlines didn't matter to his
audience. But for many of our customers, a small change in headline does
produce significant results.

~~~
ig1
Sure, I wasn't saying that the headline results were generalizable, but "no
significant difference" A/B testing results can tell you to stop wasting time
in an area.

------
noelwelsh
A lot of these "no results" are just bad statistics. When a test shows no
result it does not mean there is no significant difference. It means that the
test cannot reliably tell given the available data. Technically, this is
because as you decrease the probability of a type I error (lower the p-value)
you increase the probability of a type II error. To lower the probability of a
type II error, and hence have null results actually mean there is no
significant difference, you needs lots of hits. Under assumptions that are
reasonable for most e-commerce sites (e.g. 5% conversion, you want to
distinguish changes of 5%) you need about 100'000 hits. Very few people have
this much data, and hence their inferences are flawed. You can do a lot better
than A/B testing with better maths. More here:

[http://untyped.com/untyping/2011/02/11/stop-ab-testing-
and-m...](http://untyped.com/untyping/2011/02/11/stop-ab-testing-and-make-out-
like-a-bandit/)

~~~
paraschopra
I assume people who have run these "no results" A/B tests only gave up after
enough traffic to establish no significant result was found given pre-
determined X power and Y statistical significance cutoff. We have a calculator
here to help determine what is the maximum number of visitors you need before
giving up on a test: <http://visualwebsiteoptimizer.com/ab-split-test-
duration/>

~~~
noelwelsh
That's a pretty big assumption to make. I've seen people doing A/B testing w/
a few thousand hits max, which is not enough for most scenarios.

~~~
paraschopra
What's the big assumption? We are using math here, not heuristics. There is no
thumb rule like thousands and hundreds of visitors needed to get significant
results.

~~~
noelwelsh
I was questioning this assumption of yours: "I assume people who have run
these "no results" A/B tests only gave up after enough traffic to establish no
significant result was found given pre-determined X power and Y statistical
significance cutoff".

I am familiar with the maths for A/B testing. The figure of approximately
100'000 hits comes from "Controlled experiments on the web: survey and
practical guide". As I stated, this assumes 5% conversion rate and a few other
things. Here's the quote:

"If, however, you were only looking for 5% change in conversion rate (not
revenue), a lower variability OEC [Overall Evaluation Criteria] based on point
3.b can be used. Purchase, a conversion event, is modeled as a Bernoulli trial
with p = 0.05 being the probability of a purchase. The standard deviation of a
Bernoulli is √p(1− p) and thus you will need less than 122,000 users to
achieve the desired power based on 16 ∗ (0.05 · (1−0.05))/(0.05 · 0.05)^2."

(The actual value is 121'600.)

~~~
ltjohnson
122,000 seemed absurdly large to me so I just looked up the paper you
reference [1]. It looks like there is a calculation error in the paper.

The formula they use is n = 16 σ^2 / Δ^2. σ is the standard deviation, and Δ
is the size of the difference. Thus in this problem, Δ = 0.05 and their
formula gives n = 16 * (0.05 * (1-0.05)) / 0.05^2 = 304. This is much more in
line with what you get from using a 2 sample proportion test (with H_a: p_1
=/= p_2), ~440 in each group [2].

But maybe I misunderstand their formula.

[1] <http://exp-platform.com/hippo_long.aspx>

[2] <http://statpages.org/proppowr.html>

edit: fixed Greek letters and added final comment.

~~~
gjm11
I think you misunderstand what they're doing with the formula. Delta is the
size of the difference you want to detect. In this case, they're saying they
want to detect a 5% (relative) change in a 5% (absolute) conversion rate. So
σ^2 = 0.05 . (1-0.05), but Δ is 5% of 5% or 0.0025 and the denominator needs
to be 0.0025^2.

If you go to your reference [2] and enter the numbers 0.05, 80, 0.05, 0.0525,
1.0, you'll see that they come up with a sample size of about 122k in each
group (so 244k in both together).

(The figure of 304 or 440 is what you would get if you wanted to detect an
_absolute_ change of 5% in the conversion rate: going from 5% to 0% or to
10%.)

~~~
ltjohnson
You're right, reading the paper that way, they are trying to detect a change
in convergence rate from 5% to 5.25%, I was confused by the use of % in two
separate contexts in the same sentence. That being said, I think this is not a
good argument against A/B testing.

Fair enough, it would take a very large sample (122K is close enough) to
detect a change from 5% to 5.25%. Being concerned about a change that small
seems really silly unless 0.0025 * N visitors * revenue per user is a big
enough number to be concerned with. I contend it won't be unless either

(1) N visitors is very large or

(2) revenue per user is very large.

If (1) is true, then testing on 122K users is not a big deal. If (2) is true
you probably want to have a much more targeted approach, like someone doing
sales.

~~~
gjm11
Fourteen relative changes of 5% will double your conversion rate. Or halve it.
The cost of an A/B test is small enough that if it took, say, a sample size of
1000 then it would be well worth A/B testing changes that might make a 5%
relative difference. On the other hand, if it takes a sample of 122k then
indeed you might well decide not to bother -- e.g., because it might be
impossible. Which is why "it takes 122k rather than 1k to tell with any
confidence" is interesting.

(Rough numbers: suppose you get 1000 visitors per day and convert at 5%, and
suppose each conversion is worth $10 to you. Then you're bringing in about
$180k/year from them, and a relative change of 5% in that is about $9k. Seems
worth doing a modestly-sized A/B test for, but if it takes 4 months then you
might reasonably decide to spend your effort elsewhere. (Or, of course, not:
the actual cost of doing the test is rather small. But a lot can change over 4
months.)

------
wccrawford
Every test produces results. Only 1 in 8 produce positive results.

Knowing that 2 different approaches are approximately the same is a great help
in design... It eliminates that 'Which way is better?' worry and lets you
design it the way that looks best, instead.

~~~
paraschopra
Yep, it is true that even insignificant results can help you concentrating on
how to make it look better. But insignificant results are disappointment if
your goal is to increase conversion rate.

------
eljaco
They might also want to test a link to an FAQ or Details page. I was trying to
see if this was only for iPhone or Android but there's really only one option
: give us your email.

~~~
noahkagan
Good point. It's mostly for web or client (Mac / Windows). Might do deals on
phone apps in the future. <http://appsumo.com/faq>

------
gohat
I'm not sure these examples are entirely failed tests - rather they teach what
is working and what isn't.

~~~
noahkagan
That's really true. It's funny that some of the changes we weren't expecting
to make much of an impact have had the biggest results.

------
random42
Negative results are results too.

