Right, but unless you run a gigantic site, can you really test stupid little things like button designs, when you need thousands or even tens of thousands of conversions to have enough data for a meaningful result?
I feel a bit burned, because last test I ran took me a week to set up. I wanted to see if having extra display ads would affect the ratio of people who convert from visitors to repeat visitors. I put new users into groups, and after two days I had 800 people in both groups. Then later on with data in hand I realized that the change wasn't large enough / I didn't have enough people in my test to get a significant result.
You spent a week setting up the test, then ran it for 2 days and were disappointed that you didn't have sufficient data?
If you've set up A/B testing right there is no problem letting tests run as long as they need to. You spent a week setting it up, you can let it run for a month. That would let you detect differences 1/4 of the size of the ones that are significant from your 2 days of data.
And besides, if you need a week to set up an A/B test then you don't have a good A/B testing framework for your site. Solve that problem, then A/B tests should be easy to do.
Okay, I followed the instructions in that post, and unless I made a mistake you cannot show an improvement from 10% -> 11% conversion rate if you only had 2000 users to play with.
Let's say that from 1000 visitors I have learned that my conversion rate is 10%. Standard error = 0.00948683298. So with 95% confidence conversion rate is actually between 8.14% and 11.86%. I run an experiment on 1000 visitors and seemingly have a improved conversion rate of 11%. But with 95% confidence it is actually between 9.06% and 12.94%. This heavily overlaps the previous result, so the result isn't significant.
My takeaway from this is that it isn't worthwhile to do very small tests, but it is possible to get reliable results when doing fewer, bigger changes.
My takeaway from this is that it isn't worthwhile to do very small tests, but it is possible to get reliable results when doing fewer, bigger changes.
That is exactly wrong. You don't want to run a few big tests.
Instead run multiple minor tests at the same time. Make your tests be small changes - an email subject line, the size of a button, the help text you're using, etc. And test them in parallel to let you answer questions more quickly.
This strategy is not perfect because there may be interaction effects between your different tests. With larger traffic volumes you can address that problem. With smaller ones you need to hope that interactions are not significant. Usually interactions are not that significant, so you'll get away with the simple strategy.
What sorts of things should you test? Here are some ideas. Button size, page layout, help text, adding/removing form questions, email text, email subject lines, landing pages, registration pages, and so on and so forth. Interestingly the overall aesthetics of your site can be tested but seldom show a result. You tend to get payoffs from stuff that directly affects user interactions at points where users are not very motivated to continue with your site.
If you're going to be at the SXSW conference next month, the CEO of Freshbooks has a panel on A/B testing on Sunday that will give you some great concrete ideas of what has worked for a variety of companies. I highly recommend it. (Disclaimer, I'm on said panel.)
I think what he is saying is that running minor test on extremely small traffic isn't going to give significant results soon and I would agree with him.
Though I agree with you that doing minor tests is a better strategy given you have enough traffic for that.
I say that because the biggest bang for the buck that I've seen from A/B tests have tended to be really small changes. Take out an unneeded form element. Take out the verbose explanations of questions so the form looked less intimidating. Choose an email subject line that attracted the right group of people for that email.
Most developers would think of those as minor changes. None took more than a couple of hours to code. But I've seen those changes result in 10-40% lifts to the bottom line. That's not a minor change to the business as a whole.
More importantly the figures that were quoted suggest about 800 conversions/day. The patience to let tests run 2-4 weeks can result in measuring that level of improvement. And after you have a few wins of that size, you start a virtuous cycle where every improvement makes it easier to raise traffic volume, which makes it easier to run A/B tests, which makes it easier to find further improvements.
> My takeaway from this is that it isn't worthwhile to do very small tests, but it is possible to get reliable results when doing fewer, bigger changes.
Yup, that is right. If you have less traffic, testing for drastic, bigger changes will definitely give you statistically significant differences (either positive or negative). In fact, even the bigger sites can choose to include just a fraction of their traffic (say 2%) and then test for dramatically different designs/callouts/etc.
I apologize in advance if any of this is repetitive with things I have said before:
1) There is a difference between what you and I think are really stupid little things and what reality thinks are really stupid little things. For example, I think it is likely that you can (with a suitable A/B testing framework) change a call to action on your site in under five minutes of coding, counting the time to redeploy. Many engineers would consider that a stupid little change. Empirically, exactly that test has resulted in double-digit improvements before. So totally do that.
2) An A/B test which leads to the result "Not enough data to tell" doesn't convey zero information. It conveys an important bit of information: "Well, neither of these two alternatives lit a fire in the hearts of my users." That should give you the ability to recalibrate your efforts in the future.
3) With the context that I run a very part-time business: for the last couple of years I aimed at doing four A/B tests a month and getting a 5% increase out of one of them. If the other three end without significantly significant results, oh well. (See http://www.bingocardcreator.com/abingo/results -- one in four isn't too far off what I actually get. Nota bene that is a convenience sampling of results I've actually accumulated, not all of them.) Of more applicability to you: you have to accept that A/B testing is a process which converges on awesome rather than a button which is onclick: deliverTheAwesome(). Over time, I promise you, it really does work. (It almost can't not work.) However, individual tests will frequently return null result or tell you that the new code you just spent time writing is a waste.
4) I try to avoid doing tests that take a week to set up unless I have a good reason to suspect it is going to seriously move the needle, because for the same amount of work you can throw a lot of smaller things at the wall and see what sticks. For example, I've put off an A/B test that I absolutely have to do until I am gainfully unemployed just because of the implementation difficulty. (A: current site selling online and downloadable software. B: what are you talking about, thee is only an online version.) Especially if you're new to A/B testing and don't have a good example that you can point to yourself and say "Bemmu, you feel like quitting, but REMEMBER THIS?! Oh good golly that worked out right. Test on!", test some stuff which takes 10 minutes to bang out alternatives for. Conversion buttons, calls to action, positioning of buttons, that sort of thing.
5) "Things that really move the needle" and "things that take a lot of time to implement" do not relate to each other. In fact, they sometimes have so little relation it is hilarious. Ask me how many weeks I have burned in developing features no one cares about versus how many minutes it took to change five words on my purchasing page. ("Buy a single copy via ..." -> "Get instant access to ...") Sure, go ahead and test the button designs. Let reality tell you what matters rather than thinking you have a good idea. I haven't the furthest clue.
I feel a bit burned, because last test I ran took me a week to set up. I wanted to see if having extra display ads would affect the ratio of people who convert from visitors to repeat visitors. I put new users into groups, and after two days I had 800 people in both groups. Then later on with data in hand I realized that the change wasn't large enough / I didn't have enough people in my test to get a significant result.