
How we improved signups by 30% by doing nothing. - StavrosK
http://blog.historio.us/how-we-improved-signups-by-30-by-doing-nothin
======
jfarmer
Every time you do a significance calculation and decide whether to stop the
test or continue you increase the likelihood of a type I error (i.e., false
positive).

So, your 99.8% confidence isn't 99.8%.

There are a few ways to compensate for this. The easiest is to fix your sample
size and not reach any conclusions until you have tested that many users.

You can determine your sample size by, say, figuring out the minimum number of
people needed before there's a 95% chance you observe a 10% effect size.

This is a problem in clinical trials where ethical questions arise. If the
test appears harmful, should we stop? If it appears beneficial, isn't it
unethical to deprive the control group of the treatment?

Anyhow, if you want to observe the results continuously, you need to use a
technique like alpha spending, sequential experimental design, or Bayesean
experimental design.

TL;DR: If you're periodically looking at your A/B testing results and deciding
whether to continue or stop, you're doing it wrong and your significance level
is lower than you think it is.

~~~
StavrosK
Thanks for that, that's something I haven't seen anywhere. Do you have any
references that explain why this error occurs?

~~~
carbocation
When you take multiple peeks at the data as it begins to accumulate, you're
essentially performing multiple testing.

Imagine that you wanted to see if A != B, with a 95% confidence. To
oversimplify, this means that you're willing to accept that 1 in 20 times,
you'll incorrectly reject the null and you will consider A different from B
even though they are truly the same.

If you run 20 independent tests at once, each at a 95% confidence, then by
chance you'd expect 1 to reject the null even if they're actually all null.

Now, if you repeatedly peek at the data as it accumulates for one test, you're
doing something similar (not 100% the same because the tests aren't totally
independent, but similar). To again oversimplify, you'd expect, by chance, to
see "significant" results 1 in 20 of the times that you look at the data, even
if there is no significant result. This is why you need to wait until you've
collected all of the necessary data first, or implement a procedure to protect
you from incorrectly seeing "significance" when it's not there. You might, for
example, require a more stringent confidence level earlier on, and less
stringent ones later.

~~~
StavrosK
Yep, that's what I realised from your and jfarmer's comments. It does make
sense that you're sampling multiple times and peeking to see if you've reached
that, thank you.

------
ZeroMinx
I like this.

Not very many people have studied probability, whereas a lot (relatively) have
heard the good effects of A/B testing.

I think A/B testing is a very good method, but you need a lot of data. I'd
say, when you're just starting up, don't try A/B. You won't get the data you
need.

It's _very_ easy to be seduced by statistics. It doesn't matter if the stats
are wrong.

EDIT: seduced, not deduced.

~~~
carbocation
It's very easy to use the wrong statistical method or to do things with your
data that you shouldn't, which causes you to misinterpret the output of
statistical methods.

Which is more likely: (a) that this was the 1 in 1/(1-0.998) times that the
correct method incorrectly rejected the null; (b) that there is actually a
cryptic difference between A and B in the AB test; or (c) that the wrong
method was used, or the right method was misused, or something along those
lines?

I would say that those are arranged in increasing order of likelihood.

~~~
StavrosK
There wasn't much to go wrong... The A/B testing system renders one template
in two ways depending on whether the user is in the test or control group, and
the template was the same in both cases. If we _had_ made a mistake, the
results wouldn't normalise when we got more data, they would have just stayed
there.

~~~
carbocation
I'm saying that the mistake was overinterpreting the statistics without having
a good handle on what they actually meant (part of option 'c' above). This
comes off as _ad hominem_ , but it's the most common error mode so I'd use
this as a motivation to learn more stats and understand why the test was
probably not telling you what you thought it was.

------
carbocation
This post would be useful if it told us (1) the actual numbers of visitors vs
successes along the timepoints that you mention, and (2) the formula that you
used to calculate confidence. Without knowing those 2 facts, it's hard to
conclude anything. Also, is that confidence interval corrected for the
multiple peeks that you took at the data? Etc. I'd like to know much more
about the particulars of the methodology here, because it sounds like
statistical methods might have been misunderstood or misused (perhaps by the
A/B software or anywhere else along the chain).

~~~
StavrosK
I can't share the actual data, but the method of calculating the confidence
was the standard formula from calculating the confidence for the Z-score as
used in <http://abtester.com/calculator/>, Google Website optimiser, etc etc.
The numbers did check out, it was just an unlikely fluke, which is why I don't
trust the confidence interval as much as I used to...

~~~
carbocation
Before running this test, did you estimate the sample size needed to detect
the estimated difference in effects from A vs B? Since the expected difference
is 0, the necessary sample size would be infinite. Therefore, we could at
least estimate a very small effect, which would necessitate a very large
sample. I would be very cautious, then, about trying to interpret the
"confidence interval" before you had accumulated that large sample size. If
you're going to peek before getting to that point, then you should almost
certainly have implemented an alpha spending function (or similar) in order to
maintain the desired overall error rate.

This should not reduce your belief in confidence intervals; this is a great,
motivating opportunity that should prompt you to learn how to use them
correctly.

~~~
StavrosK
No, we hadn't estimated the necessary sample size. To be honest, nobody
expected the confidence metric to go so high for a test with no changes, which
is why we were surprised. We've studied various A/B testing resources and we
haven't seen this "peeking error" mentioned anywhere, so we'd appreciate any
resources you may have (and would possibly summarise them in a subsequent
post).

Thanks for the help!

~~~
carbocation
Sure thing. I just posted a bit more detail about this elsewhere in this
thread. Also, googling "alpha spending function" brings back some useful
results. (I can't paste into the HN text box with the latest Chrome dev,
otherwise I'd paste some links here for you.)

~~~
StavrosK
I saw that as well, thanks. This isn't the sort of statistics we did in uni,
so it's always good to get extra pointers. Thanks again!

------
mattmaroon
"The important lesson here, and the one you should take away from our
experience, is this: Whenever you think you have enough data for the A/B test,
get more! Sometimes, you will fall into that 0.1%, and your decision will be
wrong, and might impact your metrics adversely, and you might never find out."

This is actually terrible advice because continuing a test in which one set is
significantly better than another has a cost. You are showing an inferior set
to a segment of your users and that costs you money (or signups or whatever
metric it is you're improving which, at the end of the day, presumably equates
to money).

As an example, suppose you do a test and discover something that doubles your
signup rate (and therefore monetization rate) and you've got a confidence
level of 99.9%. It's true, there's a 1 in 1,000 chance your result is flawed
and you'll end up with the wrong decision. But there's a 999 in 1,000 that
you're showing a significantly inferior signup page to half of your customers,
costing you about 25% of potential revenue. It doesn't even take someone who
knows what EV stands for to realize his EV on ending the test is huge here.

~~~
StavrosK
I guess that depends on long-term vs short-term gain. Sure, you'll gain a few
days' worth of signups, but you have a small chance to make a decision that
will impact every visitor from that point on negatively (which you haven't
detected).

------
Estragon
"Why Most Published Research Findings Are False"
[http://www.plosmedicine.org/article/info:doi/10.1371/journal...](http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124)

~~~
carbocation
And the specific reason here (among the ones listed in that article) is almost
certainly a lack of data. It's always helpful to estimate your necessary
sample size prior to beginning your test, so you don't get too excited when
results look odd before you get anywhere near the sample necessary to see an
effect of a given size.

~~~
StavrosK
Sure, but the way you estimate your sample size is working backwards from the
confidence level you want to get. In this case, we were waiting for the
confidence metric to tell us when the sample size was appropriate, but it
didn't work out as planned.

~~~
jfarmer
No, you're doing it wrong.

You say, "How many people should I test before there's a 95% chance I observe
a 10% effect size?"

That's your sample size. It's easy to compute up front.

~~~
StavrosK
Hmm, yes, I see now why you're right. I'd like to read a bit more on the
theory, though, so we don't make these mistakes again. I'll look around for
some resources, thanks again!

~~~
carbocation
Terms to get you started are "power analysis" and "sample size estimation."

------
huhtenberg
Perhaps one page was loading faster than the other?

~~~
ZeroMinx
Perhaps.

But if I flip a coin 10 times and get the queens face (I'm in the UK) 8 of
those times, that doesn't mean I'll keep getting head.

It's been quite a lot of years since I was involved in the betting world now,
but at the time I was constantly amazed at the amount of people that thought
the Martingale system was a winner.

~~~
eru
What does the Martingale system have to do with believing that you will keep
getting what you got?

------
desigooner
any relation of the number of increased signups to the Xmarks announcement?

I know I almost signed up but i'm still tossing it up between pinboard and
historius

~~~
StavrosK
No, this was about a month ago. You can give historious a go, as the account
is free (as opposed to pinboard), so it's no-risk!

