

Ask HN: Small A/B tests impractical? - bemmu

I was playing around with numbers to see how many experiments I need to get a statistically significant A/B test result. I used to think that I can solve any small issue in my app by just A/B testing which way is better (patio11 influence), but now it seems that I can't do any tests in a realistic timeframe unless they have more than 2.5 percentage points of effect.<p>If you have a landing page with 10% conversion rate and want to see if your change improved it to 11%, it seems that you need 8000 people in group A and 8000 people in group B. I'm getting about 400 new people daily to try it with, so it becomes impractical to do small tests like these. If I can improve conversion from 10% to 12.5%, then I only need 2600 people, which would mean I could run one test per week.<p>Am I kidding myself if I run tests that aren't statistically significant? Of two options, even if the difference isn't significant, it would still seem the odds are in your favor to pick the one that performed better.<p>I didn't do the math, just used a random calculator I found: http://www.prconline.com/education/tools/statsignificance/index.asp
======
patio11
Ooh, I think you gave me a good idea for a blog topic.

1) You can run multiple A/B tests at the same time without compromising
results, if you assume they are independent. Your stats teachers at college
will be mad at me for this, but "That is a safe assumption 99% of the time."
(Shh!)

2) In terms of prioritizing your time, I wouldn't worry about things you think
are likely to be marginal _unless_ they're at key areas in your application
where a 1% marginal lift would matter to the business. For example: 1%
improvement in observed results of the font selection dialog box: worthless to
the business. 1% improvement in the shopping cart: heck yes worth it to the
business.

3) We often underestimate how effective certain changes are going to be. If
you throw up a test and get statistically insignificant results, that doesn't
cost you anything and you still learn something: oh well, learned that users
do not empirically think the difference between A and B is supremely
compelling. If on the other hand you say "Eh, this is likely to only be a 1%
factor" and it turns out to be a 10% factor ("BUTTON TEXT?! WHO CARES ABOUT
FREAKING BUTTON TEXT?! TEN FREAKING PERCENT?!"), you'll get statistically
significant results _and_ learn very valuable things.

~~~
bemmu
I'll be honored if I actually managed to inspire the great patio11 to write a
new post :) And yes, I did learn that the thing I was testing doesn't have a
very large effect, which was valuable to know.

