Hacker News new | past | comments | ask | show | jobs | submit login
An A/B Testing Story (kalzumeus.com)
166 points by forgingahead on Sept 20, 2016 | hide | past | favorite | 46 comments



"If you work at a software company. . . you should be doing A/B testing for the same reason you have an accountant do your taxes. It is a low-risk low-investment high-certainty way to increase economic returns."

Spot on, and you should 100% support Patrick & Nick and buy their books / videos / etc.

But A/B testing is not as simple as hiring an accountant, running A/B testing as part of your corporate culture requires:

- Front end developer to install / manage testing engine.

- UX designer to measure / understand where holes may be in your flow.

- Competent web designer to execute instructions from UX designer.

- Statistically sound processes to know when to "call" a test (no Optimizely does NOT do this)

- SEO / marketing people to make sure that your A/B test didn't just break marketing flow

- etc

Yes, anybody could go run a headline test.

But running A/B testing with any sort of regularity, scale, and success is a complicated problem.


Meh, you make it sound too hard. Here is a perfectly adequate A/B testing regime.

Assign visitors into buckets randomly. Track conversion counts only. Declare the winner of an A/B test whenever one gets 100 conversions ahead, or whichever is ahead after 10,000 conversions. Do this with every test. As long as there are no obvious interaction effects, you can run multiple tests in parallel.

If you wish, you can replace 10,000 with N where N is however many conversions you expect to get in a month. Replace 100 by the square root of N. Real differences below 2/sqrt(N) are basically coin flips. Even after many tests, you are very unlikely to have ever made an error as big as 5/sqrt(N).

If you want to do something more sophisticated than this, you need competent people. But this is worlds better than what most companies do. (And yes, it is better than what Optimizely does for you by default.)


I appreciate what you're trying to say here, but you've basically just described the prototypical way not to run an A/B test. Ending a test as soon as you reach significance ensures a high rate of false positives, and doesn't clearly tell you when to end the experiment if there is no effect.

See also: http://www.evanmiller.org/how-not-to-run-an-ab-test.html


You have just pattern-matched to a response without understanding. And you're wrong.

See http://www.evanmiller.org/sequential-ab-testing.html for a very similar methodology presented by the very person you are citing. The difference is that I am aiming to always produce an answer from the test, even if it is a statistical coin flip, while he aims to only produce an answer roughly 5% of the time if there is no difference.


Maybe I am not understanding your post, but aren't you just declaring a winner after N trials, even if that is not significant? That seems to be a critical distinction here.


Right. We are trying to make a business decision, not conduct science.

If you go with whichever is ahead, you'll reliably produce correct answers for wins at that you can't reliably measure. Yes, there will be mistakes, but the mistakes are guaranteed to be pretty small.

If you insist on a higher standard, you'll learn which choices you were confident of..but now need to figure out what to do when the statistics didn't give you a final answer.

I think that the first choice is more useful. Evan prefers to clearly distinguish your coin flips from solid results. But in the end you have to make a decision, and it isn't material what decision you make if the conversion rates are close.


I'm sorry, I wouldn't call that framework A/B testing also much as running two experiments and seeing who's ahead after some arbitrary period of time.


Well, that is what A/B testing is. With a decision rule to try to make that determination in some sensible.

Usually the decision rule is stated in terms of a statistical confidence test. Usually the confidence test is done poorly enough that it doesn't mean what people think it does.

And the stopping procedure isn't actually very arbitrary. You choose N based on the most data that you're willing to collect for this experiment. And stop when you're confident about which version will be ahead at that point.

So this procedure leaves you confident of having the best answer that you will get from the most extensive test that your organization is willing to commit to running. And the cost to the organization of running it is capped at sqrt(N) lost conversions.


I think this post summarizes it really well: http://bjk5.com/post/12829339471/ab-testing-still-works-sarc...

If you have any sort of decent dashboarding the cost of a wrong decision is really not all that bad compared to the cost of being a purist.


Oh boy. Let's replace statistics with guesses off of graphs!

That visible "zone of stupidity" is based on how long it takes to make a conversion which has everything to do with how your product works, and nothing to do with how statistics works. There is absolutely no difference between the graph you expect that leads to an accidental wrong decision and one that detects a correct difference - the patterns that you think you see don't mean what your brain will decide they do.

And more importantly, if you stop in 1/4 of the time that you would have been willing to run the test, the potential loss when you are wrong is twice the worst errors that you could make if you put more effort into it.

Have you ever been in an organization that rolled out an innocuous looking change that killed 15% of the business? I have. Over the 10 months it took to find the offending subject line, there was, shall we say, "significant turnover in the executive team".

Math exists for a reason. Either learn it, or believe people who have learned it. Don't substitute pretty pictures that you don't understand, then call it an explanation.


I admit I don't know the math well so I am curious to know how to fix my intuition:

Let's say you want to figure out the unknown bias on two coins. You flip both continuously and plot the percentage of heads you see. Due to law of large numbers these percentages will eventually converge to true probabilities (which is how I am interpreting the graphs in that blog post).

The bad case is if the two coins are actually "flatlined" in the wrong order so as a pattern matching human you mistakenly believe the rates have converged prematurely. I don't know how to work out the math on this but let's say a "flatline" is visually 100 points or so with no significant slope. Then this should be pretty rare right?


Don't try to do this by visual pattern recognition. Do math. There are plenty of statistical tests that you can use, use them. Any of them is better than looking at a graph and guessing from the shape.

If you want to try to understand what is going on, learn the Central Limit Theorem. That will let you know how fast the convergence is to the laws of large numbers. (There are two, the strong and the weak.)


I take it you're the person who wrote the original to which your linked article responds?


Yes.

It took a lot of work to get down to the super simple version. :-)


You're describing the math of A/B testing, but not the work of analyzing data to decide what experiment to do next, designing the experiment, building and integrating it, then assigning visitors to the buckets and so forth.


The technical end of building / integrating / assigning visitors and so on tends to be relatively simple if it is done by competent people who know how to not overcomplicate things.

Coming up with ideas to test tends to be fairly easy if you've got a competent product person.

Perhaps I'm biased. I've built multiple A/B testing systems for multiple companies, and know what to do. However it didn't seem hard for me the first time either.


This seems like a nice application of two related principles: 1) The harder it is to tell which choice is better, the less it matters which you choose, and 2) a decision has value even in the absence of proof that it's the very best decision because it frees up resources for the next decision, which may be more important.


Depending on the implementation of "assign visitors into buckets randomly", this should include a test to ensure the populations are correctly randomized. "Unsophisticated" users of these heuristics are prone to make terrible decisions if the treatment populations are not properly randomized (this can be due to systemic bias caused by e.g. caching, or treatment effects like survival bias).


Can you explain the statistics behind this methodology?


Essentially the same as http://www.evanmiller.org/sequential-ab-testing.html except that he aims to not produce an answer most of the time by chance, while I aim to always produce an answer even if it is statistically a coin flip. (Because, after all, you need to actually choose a headline in the end...)


Totally agree, a few bullet points you almost touched upon, too:

1. Management buy-in to always be testing and always be investing in testing

- It's one thing to have the boss wake up tomorrow and decide "We should do some A/B testing", have the team go off and implement a few tests, then when that's done, move onto the next feature-of-the-month. To really make it pay off, A/B testing needs to be done all the time, to the point where it drives priorities and it makes no sense to commit to the next thing until the tests confirm what that thing should be. Which brings me to:

2. A product design culture that integrates the results of A/B testing full circle back into product decisions

- Why even do these tests if results don't feed back into the product? If testing says header bar A's choices perform/monetize better than header bar B, but the designer sticks with header bar B simply because of his artistic "gut sense", then why are you even investing in testing?


I would add

1a. Management buy-in to devote resources to clean up complete/unsuccessful tests.

I witnessed first hand the ball of mud resulting in 10 years of test upon test upon test. It's a huge impediment to quick iteration; simple changes are hard and hard changes are impossible. It's especially bad because test code tends to be poorly designed and documented due to the get-it-out-the-door mentality.


Another big aspect of the management buy-in pieces is making sure they're happy with the underlying outcome of testing: there'll be a loser. As in, it's quite difficult to get them comfortable with the notion that they've missed out on revenue/users as a result of a test, even if the winner will provide outsized gains.


Hiring an accountant is not really easy either, though it's made easier by the fact that accountants are fairly well established.

If accounting had been invented recently, it would be hard to find someone who knows what to do and it would be hard to get the rest of the company to operate in a way that allows him to do it.

The more companies do this, the easier it well get. The fewer people do this, the more of a competitive edge it is.


Even if you think Optimizely does a bad job of calculating test results, I'm assuming (I'm a VWO loyalist) it still gives you the raw numbers and it still provides a lot of other value. Just handling the basic logistics and measurement of the test is well worth the price.

You're mostly right in what types of resources you need, however, I'd add that the "competent web designer" should also be competent in jQuery. For a lot of tests, running jQuery DOM manipulations is required/preferable over the visual designer. Those visual designers tend to fail when dealing with dynamic content and have the potential to halt your entire operation. Of course, you may also have teammates (or you yourself) who can do more than one of these things.


I don't doubt the value proposition that is here, and I don't doubt that the authors of this blog post will make a great video and book about A/B testing that is helpful.

This is fairly well trod ground though and for those that want a cheaper approach, here are some free resources that I like. Some of these are my own blog posts but I'm not selling anything so pardon the self promotion:

1. "The Pitfalls Of Bias In A/B Testing" - http://danbirken.com/ab/testing/2014/04/08/pitfalls-of-bias-...

2. "If You Aren't Doing Basic Conversion Optimization, You Probably Should Be" - http://danbirken.com/startups/2015/05/18/conversion-optimiza...

3. "How not to run an A/B test" - http://www.evanmiller.org/how-not-to-run-an-ab-test.html

4. "Evan Miller's sample size calculator" - http://www.evanmiller.org/ab-testing/sample-size.html

5. "ABBA Open Source A/B test calculator" - https://www.thumbtack.com/labs/abba/

(and a bonus one because the improved sign up flow in the OP still had a password prompt)

6. "Password prompts are annoying" - http://danbirken.com/usability/2014/02/12/password-prompts-a...


Thanks for sharing. It's nice to see most links greyed-out as visited, but some aren't (yet)!

Shameless plug: Two open source libraries for A/B testing I created and use regularly.

AlephBet[0] is the JS library to write your test in (with multiple backend alternatives)

Gimel[1] is an AWS Lambda backend you can run for near zero cost. Even at scale.

The Gimel (minimalistic) dashboard is using the algorithms / code from Evan Miller and others to do the bayesian statistical analysis.

[0] https://github.com/Alephbet/alephbet

[1] https://github.com/Alephbet/gimel


Hi, I'm the one who's putting together The A/B Testing Manual with Patrick. Gonna answer a few questions (thanks for asking 'em!):

1. On the difficulty of A/B testing: It is heavily dependent on a lot of different factors in your organization. Developers' ability to quickly generate variant pages is a huge issue: lots of places simply can't make changes to revenue-generating pages. Lots of places deal with politics that keep them from rationally analyzing A/B tests. And c-level support can ram through a lot, of course.

I know that we're all from planet Vulcan and are very hyper-logical about testing, but most orgs don't work that way. 98% of my work involves doing therapy on people who are used to making design decisions by internal debate. Shifting that culture is tremendously difficult in large organizations where lots of smart people have lots of strong opinions.

Put another way: yes, it's easy to run an A/B test. It is hard to ship the revenue-generating design decisions that come from A/B testing.

2. On where to start: I try to research my test ideas so I know that I'm testing the right things, in the right places. Messaging tends to have much higher impact than specific design elements, so you want to make sure you're communicating to people effectively.

With total blank-slate clients, I get your GA install in order (nobody ever has this), and run heat & scroll maps on all the key pages in your funnel. I also email a survey of your existing customers to see if there are any surprising patterns. And with my bigger clients, I even run usability tests (on usertesting.com) and customer interviews (recruiting through ethn.io) to understand their motivations for purchasing.

None of this has anything to do with actual A/B tests – but it has everything to do with making sure you're not stabbing in the dark when you do test.

Finally, agreed with @gk1 on big, drastic changes. Harder to implement (see point about dev time above) but much more likely to bear fruit – and force the org to really question how they're speaking to customers.

3. It's a new article. The A/B Testing Manual was launched yesterday at abtestingmanual.com. Videos will be ready in a month or two, tops. Launch discount expires this Friday.

Great questions. Keep 'em coming!


A question for Patrick, or for anyone else with experience implementing A/B testing from the early stages of a service/product:

When and where do you start? If you start too early, you don't have enough traffic to be statistically significant.

In the article, Patrick said that he started with the trial signup form in his 3rd year.

Most of the advice on A/B testing that I've found is understandably aimed at people with existing business/products that can massively benefit from it. Does anyone have any more material about how to get started from the early stages?


I've shared some thoughts on this previously: http://www.gkogan.co/blog/test-big-changes/

TL;DR - Test big, drastic changes instead of fiddling with button colors and headlines. Examples of drastic changes:

- Entirely different homepage with different messaging.

- Change "Features" page to "Benefits" page and change its content accordingly.

- If you're offering a free download or free product, test asking for an email first.

- If your SaaS has a long signup form, test allowing people to jump right in with just an email address.

- If you're sending a robotic "welcome" email, test sending a very short personalized email instead.


This has been my mantra as well. Test big and you'll likely fail or win a lot faster.

I'd also suggest making sure you approach testing by starting with a hypothesis and then testing that hypothesis. By voicing or writing down your hypothesis, you will be more likely to avoid the "stupid test trap", a phrase I use to describe running tests like changing button colors -- what hypothesis is changing button colors trying to prove? That people like red more, "for reasons"?

As mentioned above, each A/B test you run has a serious opportunity cost. I'm fortunate enough to have thousands of people going through my sales funnel daily but, even then, a test will still take several weeks or months to validate and anything approaching a change of 5% or less will really eat away at your ability to test more meaningful things.


See https://news.ycombinator.com/item?id=12541290 for one rule of thumb and methodology.

My experience is that most tests will come out under 5% differences. But most organizations will have some test that they could run with a greater than 15% win.

If your business scale does not allow you to run tests and detect the wins you hope for, you're better off following other people's best practices than you are trying to discover best practices from running your own tests.

Also note that most businesses have a "conversion funnel" where there are a number of steps from visitor to getting paid. If your business is big enough, you want to focus on getting paid. If your business is too small to get results that way, you should get started with just the first step in the funnel. That's what Patrick did with the trial signup form.


The "split into two steps" screenshots use the same thumbnail for both images.



As I sit here trying to figure out the difference


Maybe there was a split test to see how many people noticed. :)


We shot a second trailer video on The A/B Testing Manual’s site that was the exact same pitch, only with my dog (64lb border collie) in my lap.

Seriously contemplated A/B testing them as a joke, but erred on the side of sanity.


I wish you had!


Question for Nick/Patrick (or anybody else w/ experience): what's the accepted 2016 way to do A/B testing in Rails?


Have you looked into Optimizely paired with a gem like this: https://github.com/MartijnSch/optimizely-gem?


Or you can do like we do, put an A/B test on a about page. Then set it up wrong so it doesn't even function. Finally fix it and determine nothing useful since there is no measurable item to compare. Watch programmer eyes roll...


Let me guess, your company hired a "growth hacker?"


Is this a newly published article? Or am I likely to have seen it on HN before?


Perhaps you saw a slightly different version of the article, which didn't get as much click-through and so was replaced with this one.


Now I want to hook up a markov chain generator to some sort of a multiarm bandit setup and test each word individually.


It's a new article. I received it via Patrick's email newsletter today.

Of course, Patrick is all over HN, so parts of the article may seem familiar.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: