

3x Improvement: Great sign-up page split-test for a web-app - richardburton
http://www.abtests.com/test/35001/homepage-for-ebox-platform
Great A/B split-test posted by @bokardo
======
bokardo
Hi, I'm one of the creators of <http://www.abtests.com>. The issue of
statistical significance has come up over and over, so I'll try to explain our
view of it.

We ask people to input their raw data...both trials and conversions. If they
do this honestly (anybody can fake data about anything) then in our view the
results speak for themselves. We've had folks upload data that was obviously
not statistically significant, and we've had people write blog posts
denouncing those results. We've also had folks upload test data that was
statistically significant and people say they're learning a lot.

So we've had both solid and suspect data uploaded to the site with good
discussion around it. This is exactly what we hoped for...I think in the
future as more tests get uploaded the wheat will be separated from the chaff,
so to speak, and those tests with significant data will get lots more
attention than those that don't. In fact, we're already seeing this in the
traffic logs.

And, as several folks have mentioned, many tools do the hard stats math for
you, telling you when your data is statistically significant. This helps
people know when they can be confident in sharing their data with others.

~~~
paraschopra
Doing the math here. A/B Tests with conversions are modeled as binomial
variables. So the standard error of the conversions here is sqrt(p _(1-p)/n)
where p is conversion rate and n is number of hits (p_ (1-p) is standard
deviation of binomial distribution). Calculating standard error for both of
your versions - sqrt(0.002*(1-0.002)/2834) = 0.0008 and for the other SE is
0.0017. Now since there are large number of trials, you can model the
difference of two binomial distributions as a normal distribution, standard
deviation of whose is sqrt(se_1^2 + se_2^2) = 0.0019.

Now the way significance is checked is by using single tailed z score (we are
testing if the difference in two distributions is statistically significant
and greater than zero). Z score in this case is p_1 - p_2/std that is
(0.008-0.002)/0.0019 = 3.1579 which is way larger than the critical value of
1.65 (which corresponds to 95% confidence).

So, the difference is indeed statistically significant. A note of caution is
that some theory says that you cannot model a binomial distribution as a
normal distribution until you have at least 10 successes or failures, which is
the case here.

~~~
defen
See my reply lower in the thread - I worked out the numbers using Bayesian
inference to find the exact probability that B is better than A, subject to a
number of assumptions. The benefit of this approach is that it's exact so you
don't need a certain number of samples to properly approximate a normal
distribution. The answer is that B is almost certainly better than A. Here's
the calculation I plugged into Wolfram Alpha:

2835 2837 choose[2834,6] choose[2836,24] NIntegrate[(f^6) (1-f)^2828 (g^24)
(1-g)^2812,{f,0,1},{g,f,1}]

------
mcantor
I'm extremely rusty on my statistics, and I upvoted this because I find A/B
tests interesting, but... are these numbers statistically significant? For the
population of internet users, are they actually practically significant? It
just seems like the sample sizes and differences aren't really big enough to
draw solid conclusions from. It doesn't say how long each test lasted for,
either--what if the second test was done during peak hours?

~~~
TimothyFitz
This is exactly why I _don't_ read other people's A/B experiment results. I
haven't seen a single A/B experiment that listed it's statistical significance
(and how that was calculated). I fear that bad data is worse than no data at
all.

~~~
brown9-2
Google Website Optimizer will give you a plus minus rate for your estimated
conversion rate, which helps you to at least estimate what the
significance/confidence is.

For example, one of the pages in one of my current tests shows up as "Est.
Conversion Rate: 17.0% +/- 1.4%".

------
rarestblog
2 vs 6 orders doesn't seem like enough. There are people in affiliate programs
that don't change anything and yet one day they have 20 orders and 0 the next
day. It seems like something doesn't work, but it's back to normal next day -
just a fluctuation in stats. I really wouldn't consider 2 vs 6 orders a
significant sample.

------
3pt14159
Even if this has a G-test significance of 99.86%, it doesn't mean it is valid.
You need to have AT LEAST 10 results for the formulas to work correctly. Also,
saying 300% conversion is madness. I'm not even fully statistically convinced
that the challenger is at ALL better, let alone a certainty of 300% better.

~~~
defen
If you model the two alternatives as Bernoulli processes with unknown success
rates, and assume that the only difference between the two is what is
specified on the page, and that they don't interact, and you assume a uniform
prior on both parameters, B's conversion rate is higher than A's with
probability 0.999572.

