
An A/B Testing Story - forgingahead
https://training.kalzumeus.com/newsletters/archive/ab-testing-story
======
aresant
"If you work at a software company. . . you should be doing A/B testing for
the same reason you have an accountant do your taxes. It is a low-risk low-
investment high-certainty way to increase economic returns."

Spot on, and you should 100% support Patrick & Nick and buy their books /
videos / etc.

But A/B testing is not as simple as hiring an accountant, running A/B testing
as part of your corporate culture requires:

\- Front end developer to install / manage testing engine.

\- UX designer to measure / understand where holes may be in your flow.

\- Competent web designer to execute instructions from UX designer.

\- Statistically sound processes to know when to "call" a test (no Optimizely
does NOT do this)

\- SEO / marketing people to make sure that your A/B test didn't just break
marketing flow

\- etc

Yes, anybody could go run a headline test.

But running A/B testing with any sort of regularity, scale, and success is a
complicated problem.

~~~
btilly
Meh, you make it sound too hard. Here is a perfectly adequate A/B testing
regime.

Assign visitors into buckets randomly. Track conversion counts only. Declare
the winner of an A/B test whenever one gets 100 conversions ahead, or
whichever is ahead after 10,000 conversions. Do this with every test. As long
as there are no obvious interaction effects, you can run multiple tests in
parallel.

If you wish, you can replace 10,000 with N where N is however many conversions
you expect to get in a month. Replace 100 by the square root of N. Real
differences below 2/sqrt(N) are basically coin flips. Even after many tests,
you are very unlikely to have ever made an error as big as 5/sqrt(N).

If you want to do something more sophisticated than this, you need competent
people. But this is worlds better than what most companies do. (And yes, it is
better than what Optimizely does for you by default.)

~~~
in_cahoots
I appreciate what you're trying to say here, but you've basically just
described the prototypical way _not_ to run an A/B test. Ending a test as soon
as you reach significance ensures a high rate of false positives, and doesn't
clearly tell you when to end the experiment if there is no effect.

See also: [http://www.evanmiller.org/how-not-to-run-an-ab-
test.html](http://www.evanmiller.org/how-not-to-run-an-ab-test.html)

~~~
btilly
You have just pattern-matched to a response without understanding. And you're
wrong.

See [http://www.evanmiller.org/sequential-ab-
testing.html](http://www.evanmiller.org/sequential-ab-testing.html) for a very
similar methodology presented by the very person you are citing. The
difference is that I am aiming to always produce an answer from the test, even
if it is a statistical coin flip, while he aims to only produce an answer
roughly 5% of the time if there is no difference.

~~~
in_cahoots
Maybe I am not understanding your post, but aren't you just declaring a winner
after N trials, even if that is not significant? That seems to be a critical
distinction here.

~~~
btilly
Right. We are trying to make a business decision, not conduct science.

If you go with whichever is ahead, you'll reliably produce correct answers for
wins at that you can't reliably measure. Yes, there will be mistakes, but the
mistakes are guaranteed to be pretty small.

If you insist on a higher standard, you'll learn which choices you were
confident of..but now need to figure out what to do when the statistics didn't
give you a final answer.

I think that the first choice is more useful. Evan prefers to clearly
distinguish your coin flips from solid results. But in the end you have to
make a decision, and it isn't material what decision you make if the
conversion rates are close.

~~~
in_cahoots
I'm sorry, I wouldn't call that framework A/B testing also much as running two
experiments and seeing who's ahead after some arbitrary period of time.

~~~
rawnlq
I think this post summarizes it really well:
[http://bjk5.com/post/12829339471/ab-testing-still-works-
sarc...](http://bjk5.com/post/12829339471/ab-testing-still-works-sarcastic-
phew)

If you have any sort of decent dashboarding the cost of a wrong decision is
really not all that bad compared to the cost of being a purist.

~~~
btilly
Oh boy. Let's replace statistics with guesses off of graphs!

That visible "zone of stupidity" is based on how long it takes to make a
conversion which has everything to do with how your product works, and nothing
to do with how statistics works. There is absolutely no difference between the
graph you expect that leads to an accidental wrong decision and one that
detects a correct difference - the patterns that you think you see don't mean
what your brain will decide they do.

And more importantly, if you stop in 1/4 of the time that you would have been
willing to run the test, the potential loss when you are wrong is twice the
worst errors that you could make if you put more effort into it.

Have you ever been in an organization that rolled out an innocuous looking
change that killed 15% of the business? I have. Over the 10 months it took to
find the offending subject line, there was, shall we say, "significant
turnover in the executive team".

Math exists for a reason. Either learn it, or believe people who have learned
it. Don't substitute pretty pictures that you don't understand, then call it
an explanation.

~~~
rawnlq
I admit I don't know the math well so I am curious to know how to fix my
intuition:

Let's say you want to figure out the unknown bias on two coins. You flip both
continuously and plot the percentage of heads you see. Due to law of large
numbers these percentages will eventually converge to true probabilities
(which is how I am interpreting the graphs in that blog post).

The bad case is if the two coins are actually "flatlined" in the wrong order
so as a pattern matching human you mistakenly believe the rates have converged
prematurely. I don't know how to work out the math on this but let's say a
"flatline" is visually 100 points or so with no significant slope. Then this
should be pretty rare right?

~~~
btilly
Don't try to do this by visual pattern recognition. Do math. There are plenty
of statistical tests that you can use, use them. Any of them is better than
looking at a graph and guessing from the shape.

If you want to try to understand what is going on, learn the Central Limit
Theorem. That will let you know how fast the convergence is to the laws of
large numbers. (There are two, the strong and the weak.)

------
birken
I don't doubt the value proposition that is here, and I don't doubt that the
authors of this blog post will make a great video and book about A/B testing
that is helpful.

This is fairly well trod ground though and for those that want a cheaper
approach, here are some free resources that I like. Some of these are my own
blog posts but I'm not selling anything so pardon the self promotion:

1\. "The Pitfalls Of Bias In A/B Testing" \-
[http://danbirken.com/ab/testing/2014/04/08/pitfalls-of-
bias-...](http://danbirken.com/ab/testing/2014/04/08/pitfalls-of-bias-in-ab-
testing.html)

2\. "If You Aren't Doing Basic Conversion Optimization, You Probably Should
Be" \- [http://danbirken.com/startups/2015/05/18/conversion-
optimiza...](http://danbirken.com/startups/2015/05/18/conversion-
optimization.html)

3\. "How not to run an A/B test" \- [http://www.evanmiller.org/how-not-to-run-
an-ab-test.html](http://www.evanmiller.org/how-not-to-run-an-ab-test.html)

4\. "Evan Miller's sample size calculator" \- [http://www.evanmiller.org/ab-
testing/sample-size.html](http://www.evanmiller.org/ab-testing/sample-
size.html)

5\. "ABBA Open Source A/B test calculator" \-
[https://www.thumbtack.com/labs/abba/](https://www.thumbtack.com/labs/abba/)

(and a bonus one because the improved sign up flow in the OP still had a
password prompt)

6\. "Password prompts are annoying" \-
[http://danbirken.com/usability/2014/02/12/password-
prompts-a...](http://danbirken.com/usability/2014/02/12/password-prompts-are-
annoying.html)

~~~
gingerlime
Thanks for sharing. It's nice to see most links greyed-out as visited, but
some aren't (yet)!

Shameless plug: Two open source libraries for A/B testing I created and use
regularly.

AlephBet[0] is the JS library to write your test in (with multiple backend
alternatives)

Gimel[1] is an AWS Lambda backend you can run for near zero cost. Even at
scale.

The Gimel (minimalistic) dashboard is using the algorithms / code from Evan
Miller and others to do the bayesian statistical analysis.

[0]
[https://github.com/Alephbet/alephbet](https://github.com/Alephbet/alephbet)

[1] [https://github.com/Alephbet/gimel](https://github.com/Alephbet/gimel)

------
nickdpi
Hi, I'm the one who's putting together The A/B Testing Manual with Patrick.
Gonna answer a few questions (thanks for asking 'em!):

1\. On the difficulty of A/B testing: It is _heavily_ dependent on a lot of
different factors in your organization. Developers' ability to quickly
generate variant pages is a huge issue: lots of places simply can't make
changes to revenue-generating pages. Lots of places deal with politics that
keep them from rationally analyzing A/B tests. And c-level support can ram
through a lot, of course.

I know that we're all from planet Vulcan and are very hyper-logical about
testing, but most orgs don't work that way. 98% of my work involves doing
therapy on people who are used to making design decisions by internal debate.
Shifting that culture is tremendously difficult in large organizations where
lots of smart people have lots of strong opinions.

Put another way: yes, it's easy to run an A/B test. It is hard to ship the
revenue-generating design decisions that come from A/B testing.

2\. On where to start: I try to research my test ideas so I know that I'm
testing the right things, in the right places. Messaging tends to have much
higher impact than specific design elements, so you want to make sure you're
communicating to people effectively.

With total blank-slate clients, I get your GA install in order (nobody ever
has this), and run heat & scroll maps on all the key pages in your funnel. I
also email a survey of your existing customers to see if there are any
surprising patterns. And with my bigger clients, I even run usability tests
(on usertesting.com) and customer interviews (recruiting through ethn.io) to
understand their motivations for purchasing.

None of this has anything to do with actual A/B tests – but it has everything
to do with making sure you're not stabbing in the dark when you do test.

Finally, agreed with @gk1 on big, drastic changes. Harder to implement (see
point about dev time above) but much more likely to bear fruit – and force the
org to really question how they're speaking to customers.

3\. It's a new article. The A/B Testing Manual was launched yesterday at
abtestingmanual.com. Videos will be ready in a month or two, tops. Launch
discount expires this Friday.

Great questions. Keep 'em coming!

------
awesomebob
A question for Patrick, or for anyone else with experience implementing A/B
testing from the early stages of a service/product:

When and where do you start? If you start too early, you don't have enough
traffic to be statistically significant.

In the article, Patrick said that he started with the trial signup form in his
3rd year.

Most of the advice on A/B testing that I've found is understandably aimed at
people with existing business/products that can massively benefit from it.
Does anyone have any more material about how to get started from the early
stages?

~~~
gk1
I've shared some thoughts on this previously: [http://www.gkogan.co/blog/test-
big-changes/](http://www.gkogan.co/blog/test-big-changes/)

TL;DR - Test big, drastic changes instead of fiddling with button colors and
headlines. Examples of drastic changes:

\- Entirely different homepage with different messaging.

\- Change "Features" page to "Benefits" page and change its content
accordingly.

\- If you're offering a free download or free product, test asking for an
email first.

\- If your SaaS has a long signup form, test allowing people to jump right in
with just an email address.

\- If you're sending a robotic "welcome" email, test sending a very short
personalized email instead.

~~~
ssharp
This has been my mantra as well. Test big and you'll likely fail or win a lot
faster.

I'd also suggest making sure you approach testing by starting with a
hypothesis and then testing that hypothesis. By voicing or writing down your
hypothesis, you will be more likely to avoid the "stupid test trap", a phrase
I use to describe running tests like changing button colors -- what hypothesis
is changing button colors trying to prove? That people like red more, "for
reasons"?

As mentioned above, each A/B test you run has a serious opportunity cost. I'm
fortunate enough to have thousands of people going through my sales funnel
daily but, even then, a test will still take several weeks or months to
validate and anything approaching a change of 5% or less will really eat away
at your ability to test more meaningful things.

------
sbierwagen
The "split into two steps" screenshots use the same thumbnail for both images.

~~~
matart
As I sit here trying to figure out the difference

~~~
awesomebob
Maybe there was a split test to see how many people noticed. :)

~~~
nickdpi
We shot a second trailer video on The A/B Testing Manual’s site that was the
exact same pitch, only with my dog (64lb border collie) in my lap.

Seriously contemplated A/B testing them as a joke, but erred on the side of
sanity.

~~~
awesomebob
I wish you had!

------
kareemm
Question for Nick/Patrick (or anybody else w/ experience): what's the accepted
2016 way to do A/B testing in Rails?

~~~
gk1
Have you looked into Optimizely paired with a gem like this:
[https://github.com/MartijnSch/optimizely-
gem](https://github.com/MartijnSch/optimizely-gem)?

------
coldcode
Or you can do like we do, put an A/B test on a about page. Then set it up
wrong so it doesn't even function. Finally fix it and determine nothing useful
since there is no measurable item to compare. Watch programmer eyes roll...

~~~
gk1
Let me guess, your company hired a "growth hacker?"

------
MarkMc
Is this a newly published article? Or am I likely to have seen it on HN
before?

~~~
bbctol
Perhaps you saw a slightly different version of the article, which didn't get
as much click-through and so was replaced with this one.

~~~
jaggederest
Now I want to hook up a markov chain generator to some sort of a multiarm
bandit setup and test each word individually.

