How We Improved Our Conversion Rate by 72%

patio11 · on May 30, 2010

A bit of professional curiosity: I note the site appears to be written in Rails. What did you use for the A/B testing? [Edit: I see below that the answer is A/Bingo. I approve, for the obvious reason.] Any opinions on it?

You probably already know this, but the statistical magic behind A/B testing lets you know if A was better than B but not by how much. You can calculate "Hmm, 60% more conversions" but that conclusion is written on water, as I discover with disturbing heaven-smites-you-for-arrogance regularity every time I mention the results of the test that way.

If you sustain the 70% higher level, though, hats off and keep spreading the testing gospel!

dmix · on May 30, 2010

I use A/Bingo [1] on CareLogger. Although at work (Learnhub.com) we use Vanity [2].

I found that Vanity splits the participants more equally. I noticed that with A/Bingo one would have 50 more trials than the other ones. Not a big deal unless you check the dashboard constantly.

They both do a simple task well so either will work fine.

Regarding the sustainability, this is something I've noticed as well. The conversion rates fluctuates heavily depending on the days of the week (middle of the week is best). It swings back and forth but 25% is the new middle ground and not just the good days as it was before.

[1] http://www.bingocardcreator.com/abingo/

[2] http://vanity.labnotes.org/ab_testing.html

patio11 · on May 30, 2010

Both A/Bingo and Vanity split participants in essentially the same fashion: each new participant is assigned totally randomly. (Not only is it the same effect, the algorithm we use for it is practically identical, too.)

This tends to produce a phenomenon well-known to coin flippers: the more coins you flip, the closer the percentage of heads and tails will converge to 50/50 and the farther your counts of heads and tails will diverge from each other.

dmix · on May 30, 2010

I figured as much. It's likely that I noticed it because in the environment I use Vanity theres about much more traffic than on carelogger so the numbers even out much more quickly.

mburney · on May 30, 2010

I really like articles like these, because they give concretes rather than just general advice like "be determined, never give up", etc.

AndyKelley · on May 31, 2010

Do a barrel roll

Groxx · on May 30, 2010

I wonder if the red/green difference is partly due to that blues and greens for sign-ups are getting common, so red stands out. Could it just be an example of staying with / ahead of the curve?

dmix · on May 31, 2010

Yep, as I mentioned in the article it worked so well primarily because green was used on multiple parts of the homepage. So the red was a significant contrast to the rest of the layout.

It's like 7UPs logo, they put a red dot so it draws your eyes attention when your scanning a row of cans.

chegra · on May 30, 2010

idk, what would be interesting would be actual figures and not percentages.

Maybe you could have 7 clicks before but now you have 12. Or maybe 14 clicks and now you have 24. The actual figures will help us to judge the significance of the result

dmix · on May 30, 2010

I did share my signup conversion rate at the beginning, you can apply that to the trials for an idea.

Here's an example I grabbed from A/Bingo dashboard for the 2 different headlines test:

Version 1: 672 participants - 96 (14.42%) conversions

Version 2: 683 participants - 129 (18.97%) conversions

paraschopra · on May 30, 2010

It might not be a big deal but the difference isn't significant at 95% confidence level. It is significant at 90% confidence level but I personally prefer to shoot for >95% and ideally 99% confidence level.

tansey · on May 30, 2010

Out of curiosity, why?

To me, there is a difference between theoretical significance and practical significance. A 90% likelihood that the second version is actually better than the first version is enough for me to switch.

What is the downside of switching? About 10% of the time you'll be making a change that is no better than the old version. Unless you REALLY love green buttons, I think it's worth the risk. :)

paraschopra · on May 30, 2010

There isn't any downside. But if the test costs are negligible and you can afford to run a test for a week longer, it is always great to do so. I have seen too many tests where confidence level after touching >95% came back to 70% or so once extended the test.

An even better way is to do a follow up test where you do an A/B test where both variations are red. And if you see enough variance in that test, then I don't think you should take results seriously.

When you are testing it is always better to try proving a hypothesis wrong rather than trying to prove it right.

EDIT: clarified some parts.

jules · on May 30, 2010

Perhaps multi armed bandit algorithms can help here. They automatically balance testing to see which version is better with using the best version as much as possible.

A multi armed bandit algorithm is a gambler with a number of levers at his disposal. He chooses which lever to pull and then receives a reward. In this case lever 1 is "show page version A" and lever 2 is "show page version B". The algorithms work so that they balance discovering which lever is best with pulling the best lever.

Here's an example of a very simple algorithm. Record the average profit for page A and page B in two variables. Now with probability p (for example p=95%) choose the page with the highest average profit so far. With probability 1-p pick one at random. A more advanced algorithm could vary p over time so that it starts at 0% increases towards 100%.

http://en.wikipedia.org/wiki/Multi-armed_bandit

nagrom · on May 30, 2010

I suspect that it's because 95% and 99% are familiar numbers from a statistics class, corresponding to 2 and 3 standard deviations in a normal pdf respectively.

In statistical tests, most answers have something to do with standard deviations of the normal distribution, whether it be what 'significant' results are, the error bars on a histogram or the choice of error range on a maximum likelihood fit that has no obvious correlation with normal distributions. (All of these are prevalent in the high energy physics community.)

Statistics are very often used to support 'gut' instincts like that without necessarily understanding the underlying meaning of the mathematics. Happily, it is often the case that approximating everything to a normal and using 'sigmas' is enough to get by.

(May not be the case here, but I deal with it every day at work... </rant>)

chegra · on May 30, 2010

Oh, sorry didnt see that there.

charliepark · on May 31, 2010

Out of curiosity, was the change to "get started now" based on my comment here (http://news.ycombinator.com/item?id=1380017)? I've been lobbying for more people to try that language and to share their results. Thanks for doing that, even if it was independent of my own stuff.

dmix · on May 31, 2010

I did come across a discussion on HN recently about "Get Started" that inspired me to do it.

Although I check that particular comment was 4 days ago and I ran the test before that so maybe it was one of your earlier comments?

rriepe · on May 30, 2010

The "signup for free" part had me wondering. "Sign up" is the verb, where "signup" is a noun. I wonder if the benefit was just in eliminating the misused word. Most other A/B tests I've seen favor the phrase with "free" in it.

It's similar to the "login" vs. "log in" discussion, but I think it's a bit more clear cut with "sign up."

TheSOB88 · on May 30, 2010

Personally, grammatical mistakes like this have always bothered me, but most people don't even realize it's wrong. So it's odd to think that this would have an effect on the populace at large.

Perhaps this site caters to an overly literate part of the diabetes-suffering population? I guess those who are technically inclined tend to be more educated.

underdown · on May 30, 2010

Good read but I would point out the whole green/red button thing is completely dependent on your site design. I've run dozens of split tests on dozens of sites and there is no right answer. In fact sometimes increasing contrast on conversion points lowers conversion rate. The change in message is the big takeaway from this article.

nreece · on May 30, 2010

Related reading on button color test: Red beats Green - http://blog.performable.com/post/631526233/button-color-test...

paraschopra · on May 30, 2010

Dmix, I must congratulate you on your success! Plus your site design is very professional. Great job.

I noticed on your homepage you are still using 'Sign up for free' (at the bottom). Any specific reason for that?

dmix · on May 30, 2010

About 75% of our signups come from the homepage CTA so I only bothered to test that one. Also I wasn't sure if A/Bingo let you run the same test in multiple places.

I'm working on replacing the green footer call to actions today with the get started now.

moolave · on May 30, 2010

Congratulations! I like empirical inputs like these. I also read from one of Kissmetrics' articles that using the phrase "It's Free" also increases conversion rate.

AndyKelley · on May 31, 2010

I've become suspicious about the word "free" on any website. It almost has the opposite effect on me as is intended. I wonder if this is true for other people?

maushu · on May 31, 2010

You aren't the only one. I can guess that in a few years the 'free' effect will decrease (but not reverse since there are always new users showing up).

GrandMasterBirt · on May 30, 2010

Thanks for the info.

Question though, how do you get these measurements? Out of X people who visit the website who don't log on, how many will sign up vs not?

patio11 · on May 30, 2010

http://www.bingocardcreator.com/abingo , apparently.

The brief version is that you cookie each visitor with a random unique identifier. Each identifier maps to one of the versions under testing in a durable fashion. The first time you see an identifier for a particular test, you increment the participant count for that appropriate version. When a conversion -- here a signup -- happens, you look at the identifier, check what version they saw, and increment the conversion counter. From then it is just a math problem. So when you see a conversion rate like 24%, that implies that 24% of people who viewed a page the test was active for signed up prior to becoming lost to the system. (By, for example, leaving forever, clearing cookies, etc.)

Although one could theoretically exclude folks who log in later from the participant count, I think that is a poor use of your time for most people. Existing site users will be split across all alternatives evenly, so their failure to sign up for the site affects all alternatives easily. Since A/B testing doesn't really care about the exact value of the conversion rates and focuses on the differences between them, that comes out in the wash. (Plus, for many services, first time visitors swamp existing users of the service, so even if you were worried about distortion it would be minimal.)

dmix · on May 30, 2010

The conversion rate I posted at the top of the article is for new visitors to my site (not including returning).

The metric we used in the a/b tests was signups and the participants are anyone who lands on the homepage (new or returning).