
Why You Don't A/B Test, and How You Can Start This August - dangrossman
https://training.kalzumeus.com/newsletters/archive/why_you_do_not_ab_test
======
noelwelsh
I hope no-one minds me touting a free course I'm offering on bandit
algorithms:

[http://noelwelsh.com/data/2013/08/05/why-you-should-know-
abo...](http://noelwelsh.com/data/2013/08/05/why-you-should-know-about-bandit-
algorithms/)

I was literally working on the post when Patrick's email arrived. It's more
focused on the algorithms and will, I think, be a good compliment to what
Patrick is offering.

~~~
patio11
I upvoted you from zero, in the interest of HN remaining a supportive and
congenial place for people who are doing stuff to move the industry's state of
play forward.

~~~
noelwelsh
Thanks! I appreciate your support, especially given you don't agree with me on
the benefits of bandit testing.

------
Silhouette
I've now started flagging these kinds of submissions on HN, for the simple
reason that they are little more than ads for something vaguely specified that
may or may not even exist yet. There is absolutely no directly actionable
content in this article that I can see, nor any other original ideas that
anyone following HN doesn't see several times a week on the front page alone.

I post this only so I can note that this is not intended as a criticism of
Patrick, just a view that the hero worship and endless reposting of A/B
evangelism has far overstayed its welcome IMHO. If people want to read the
fluff as well, then as we get told every five minutes they can sign up for
Patrick's newsletter. Please, for the love of all that is holy, spare those of
us who have chosen _not_ to do so the endless join-my-mailing-list spam that
is starting to dominate HN, and save the HN submissions for substantial
content.

~~~
davidw
Flagging is for stuff that's inappropriate for this site, like political
articles, the latest injustice, politics, news stories, conspiracy theories
and so on.

For stuff like this that you may not happen to like, just don't vote for it,
and ignore it. I tend not to vote up patio11's stuff these days even though
it's good, just because it gets so much attention.

But it is 100% on topic for this site and does not deserve to be flagged.

How much time has patio11 spent giving free advice and information here? So
what if he also sells some stuff? That's what businesses do, sheez!

~~~
Silhouette
_Flagging is for stuff that 's inappropriate for this site_

I don't think almost completely content-free promotional material _is_
appropriate for this site, whoever writes it and whatever it is advertising.

I don't flag many submissions, and following the FAQ, I don't normally comment
at all when I do. However, I don't know what actually happens when something
gets flagged, so in this case I made an exception because I'm aware that
Patrick himself is wary of submitting pieces like this. I didn't want anything
to imply a personal attack, nor general criticism of links about A/B testing
if they do offer new/interesting material.

------
GhotiFish
One thing that unnerves me about A/B testing is that you can't test for
evilness. How much of this glossary[1] do you think is intentionally designed?
I doubt that all of these changes were intentional, it's just that those
changes showed objective results.

A/B testing can't test for morality, and you may very well be implying
something with your B design that you didn't mean to imply, which none the
less rates higher in your test.

In a business setting, it becomes awfully hard to argue morality against
objective numbers. It's hard to do that anyway. Many business operate with a
profit first motive. So once a dark pattern is in, how are you going to get it
out?

Don't get me wrong, A/B testing is a tool, and like all good tools it can be
used for good or evil. It just worries me that A/B testing, despite good
intentions, can lead to evil results. I don't see anything on this page about
how to avoid evil results.

1\. [http://darkpatterns.org/](http://darkpatterns.org/)

~~~
wikwocket
I can see how you don't want to be in a situation where you are holding in
your hand evidence that Shady Marketing Tactic X is proven to convert better.
But I think the solution is simple: Don't A/B test shady features. If
something makes you uncomfortable or seems like a dark pattern, don't test it,
because you would never want it on your site anyway.

I don't think this is a failing of A/B testing. That someone can get numbers
that show "evil" tactics can make money is not materially different from
someone coding "evil" HTML in Notepad and getting results. As you said, it is
just a tool.

~~~
GhotiFish
The problem is you might not be aware you're making a shady feature. As an
example, if you are making an ad for your app fooer. You A/B test some banners
on an ad distribution network. You find that a simpler ad with a "download
now" hyperlink graphic proves effective.

No big deal right? People are going to be seeing these banners on places like
news sites right? It's not tricking people. It's also pretty reasonable right?
People are used to clicking on hyperlinks, it overcomes the resistance to
click on images. A reasonable move for a banner ad to make.

Except the ad distribution network serves content sites like softpedia as
well.

You saw the spike because you emulated the download button right under the
mirror list, and people were confused. They downloaded your application under
the pretext of getting something else.

Or what about if I want a "Hey you should sign up for notifications for other
job offerings in this field" message box? I'd love to have one of those in my
application. I A/B test to see that this message box drastically increases
signups.

Except it was because people thought it was a paywall.

The road to hell is paved with good intentions.

Consider this case about the Weebly blogging service:
[http://minimaxir.com/2013/05/overly-attached-
startup/](http://minimaxir.com/2013/05/overly-attached-startup/)

They were A/B testing messaging for first week signups. This hyper aggressive
and insane route was actually considered. Do you think they're black hat? I
don't think so, but what if this had become standard practice?

------
rsync
I know exactly why we[1] don't A/B test.

It is because I don't want to serve the people that it would be effective on.

If you're marginally buying / not buying my product based on color scheme or
(buzz)word order or some other piece of puff, we're all better served if you
just moved along.

[1] You know who we are.

~~~
baddox
Is your desire to only serve certain people based in the long run on profit
motive (i.e. you think your fewer customers will be more valuable and stick
around longer), or is it just out of principle even if it means less profit in
the long run?

~~~
rsync
It is done out of principle.

------
qeorge
If you're looking for things to A/B test, please check out my post:

[https://news.ycombinator.com/item?id=6163397](https://news.ycombinator.com/item?id=6163397)

It details 19 A/B test ideas that have worked for us in the past. I hope HN
finds it useful!

------
noelwelsh
I'm interested to see the sample size calculations. Those numbers don't jibe
with any calculations I've ever done.

~~~
throwawayg99
Nor me, to collect a 1% change:

    
    
       > power.prop.test(p1=0.1, p2=0.11, power=0.8,sig.level=0.05)
         Two-sample comparison of proportions power calculation
    
                  n = 14750.79
                 p1 = 0.1
                 p2 = 0.11
          sig.level = 0.05
              power = 0.8
        alternative = two.sided
    
       NOTE: n is number in *each* group
    

Edit: so that's 30,000 samples. To do the same with a 5% change in the same
area is still 1300 samples (alpha=0.8, p=0.05)

~~~
noelwelsh
I normally consider relative minimum discernable effect. Your 5% absolute
change is a 50% increase is the base rate. I also typically go for a higher
power (e.g. 0.9). Under these conditions 60K samples is more typical.

Sample size calculator here: [http://www.evanmiller.org/ab-testing/sample-
size.html](http://www.evanmiller.org/ab-testing/sample-size.html)

You'll need to change the defaults to match the above to get the figures I
mention.

~~~
throwawayg99
You're right. That's a bit more reasonable, so back to my 10% change in base
rate (1% abs.) but with a 90% power:

    
    
      > power.prop.test(p1=0.1, p2=0.11, power=0.9,sig.level=0.05)
    
         Two-sample comparison of proportions power calculation
    
                  n = 19746.62
                 p1 = 0.1
                 p2 = 0.11
          sig.level = 0.05
              power = 0.9
        alternative = two.sided
    
      NOTE: n is number in *each* group
    

Requires about 40,000 samples per test. I would strongly recommend anyone
serious about doing this look in to MAB testing, as A-B testing is way too
expensive for reasonable scale testing (unless you have a strong a priori
hypothesis to test).

------
patio11
If you have any questions, feel free to ask.

~~~
throwawayg99
Why are you suggesting A/B instead of multi armed bandit approaches?

~~~
ampersandy
A/B testing is a catch-all term for multi-variant experimentation. Multi-armed
bandit is a specific approach to testing[1], and even though most frameworks
provide A/B/N testing -- that is, not necessarily just two variants -- it is
easier to say 'A/B' instead of 'A/B/N'.

[http://analytics.blogspot.ca/2013/01/multi-armed-bandit-
expe...](http://analytics.blogspot.ca/2013/01/multi-armed-bandit-
experiments.html)

~~~
srom
Indeed Google Analytics has added the multi-armed bandit approach in Content
Experiment. It's quite slick btw, but definitely more difficult to implement
than traditional split testing. My 2 cents:

[1] the sample size calculator (How many subjects are needed for an A/B
test?): [http://www.evanmiller.org/ab-testing/sample-
size.html](http://www.evanmiller.org/ab-testing/sample-size.html)

[2] <unashamed plug> easyAB: a jquery plugin for easily start A/B testing
using Google Analytics.
[http://srom.github.io/easyAB/](http://srom.github.io/easyAB/) </plug>

------
shadeless
The bit about 37signals reminded me of their timelapse video, "evolution of a
homepage"[1], which shows just how many things they try before sticking to
something (albeit shortly).

[1] [http://vimeo.com/29088090](http://vimeo.com/29088090)

------
thenomad
All very good stuff, but this line stood out in particular:

 _" Video makes Waterfall software development look like bugs-in-your-teeth
speed, though"_

As someone who is coming to the close of a 4-year work stint on a _short film_
, I can tell you, you ain't wrong.

Indeed, it has gotten to the point that I've started a computer game design
project as light relief.

------
ronaldx
> It was a preventable accident.

Was it _really_ preventable? How should we know when to make a quick decision
and when to A/B test?

[Clarification: it seems to me that it's better to decisively get things done]

Considering that the most widely known A/B tests are based on 41 shades of
blue and pixel-perfection, I'm not sure that this is as obvious as the article
claims.

~~~
patio11
_the most widely known A /B tests are based on 41 shades of blue and pixel-
perfection_

This is not true among people who actually A/B test for a living. (The 41
shades of blue thing is a test cherry picked to suggest that testing is not
material. The only reason the world knows about that test, as opposed to
others conducted by Google/MSN, is because someone who believed they didn't
fit in a culture of testing called out that test as the _reductio ad absurdum_
of that culture.)

Without saying exactly what it was that the client didn't know at the time,
suffice it to say that if you got five A/B testing practitioners in a room and
asked them for the top five things to try on that client's site, every last
one would have listed the problematic area as something to test. I mean, it
wasn't the H1 on the front page, but it could have been.

This is similar to "How do we make our pages load faster?" Are there large
amounts of subjectivity and risk involved here? Yes. Trying to outguess your
favorite SQL query optimizer sometimes feels like reading chicken entrails.
But, if you're not using gzip yet, then you should turn on gzip, because _gzip
always wins._

~~~
ronaldx
I accept that you are an expert and that there is value in this expertise, but
I remain unconvinced that it's so obvious where to execute A/B tests (in
contrast to page load time, where bottlenecks can be measurably identified).

------
viggity
what the hell is S/J testing? never heard of it and apparently neither has the
googles.

~~~
dangrossman
Steve Jobs.

