

A/B Testing is not Snake Oil. - sushi
http://visualwebsiteoptimizer.com/split-testing-blog/ab-testing-is-not-snake-oil/

======
elbrodeur
A/B testing is just one component of a good product development cycle. The
easiest pitfall, though, is that the data only tells you what -- not why. This
can lead you to make data-based design decisions blindly.

I think the ideal design and development cycle incorporates A/B testing at the
end to spot any outliers: You should already have done in-person usability and
user testing. It's extremely cheap and, in some cases, almost magical. Seeing
a real person use your product will guarantee you plenty of "a-ha!" moments.

After testing with real users, I think it's appropriate to test with
statistical users: You've already tackled the most glaring issues with what
you're building, and the knowledge of how real people use your website can
often help show you the "why" of the data, rather than just "what".

Steve Krug's "Don't Make Me Think" has a chapter near the end on user testing
on a shoestring budget. I'd highly recommend it.

------
Jabbles
_If you run 100 different A/B tests and only 1 of them produces good results,
you are only going to publish about that one good result and not the 99 other
unsuccessful results._

You'd better be sure those "1 in a 100" results had a confidence level well
above 99%. I think misunderstandings of statistical analysis play a large part
in the mistrust some people (mistakenly) have in A/B testing. If you get your
analysis wrong, you'll slowly realise that the promised gains of the test turn
out to be lies, hence the comparison to snake oil.

~~~
paraschopra
We only publish results that are statistically significant (albeit with 95%+
confidence)

~~~
Jabbles
But... that's my point. If you run 20 completely random tests, the probability
that at least one of them gives you a "95% confidence limit" is ~63%.
Statistical significance has to be adjusted depending on the number of times
you run your test, else it loses all its value.

I've already posted this link today, but I really recommend this guide:
<http://www.evanmiller.org/how-not-to-run-an-ab-test.html>

~~~
paraschopra
It is 100 different tests, not same test run 100 times.

~~~
mattmanser
I sometimes worry that as someone who's running an A/B testing startup you
show a shocking lack of understanding of statistics.

If you run 100 tests, a small number of them will be FALSE positives, their
confidence level will be above 95% even though they are, in fact, just
statistical anomalies.

That's why it's 95% confidence and not 100% confidence.

What it means, by hiding all your failed tests, is that you are probably only
writing about FALSE positives.

If you can't understand that, how can we trust any of your blog posts?

NB/update: Even though I'm generally pretty good at maths, I've always found
statistics extremely hard. I totally understand how hard it is because it can
produce such mind-boggling counter-intuitive results. But if you don't
understand something that's key to your domain you should learn about it, not
gloss over it.

Regardless of the statistics though, I personally think
<http://visualwebsiteoptimizer.com> is very good.

However I have seen people here on HN make posts about A/B testing that even
with my shallow understanding of statistics makes me raise my eyebrow that
they really don't get it.

~~~
reeses
I had the basic pre-college understanding of probability and statistics when I
started as the senior architect at a company producing multi-variate testing
software.[1] When you dip your toe into that pool, Taguchi is the first thing
one reads about, so the team implemented that.

It became apparent that Taguchi wasn't really appropriate or sufficient for
web-based testing, so the team learned, devised, and implemented more
appropriate MVT models.[2]

One notable bug that we discovered involved a self-optimizing test. The idea
was that, once we reached a certain confidence level, we would slowly grow the
number of targets that were fed the most successful variant.

We had a minor (on the order off off-by-one or switching a < and a <=) code
error that grew the successful variant too quickly, at a point where the
confidence level was effectively non-actionable.

As I recall, it took us about six months to notice, and none of our clients
noticed.

MVT, and especially our implementation, is obviously much more complicated
than straightforward A|B testing. Given the fact that no one was able to sniff
out such an obvious error when their tests didn't improve conversion as much
as expected has left me with the idea that, while testing is not snake oil, I
have 99% confidence the population involved in split testing has only a
superficial idea of what they're doing.

[1] I had previously implemented a very simple Apache plugin, mod_gating, that
I should clean up and throw on github. Most of the work was in the lexer for
the configuration file. :-)

[2] Much of the design of appropriate statistical models was done through
consulting with statistics departments at a couple local top-ten universities.
We figured advanced stats is like cryptography, if you're not an expert in the
general field and you come up with a "proprietary" solution, you're probably
screwing something up.

------
chrisaycock
In the spirit of A/B testing, perhaps you could point to some reasonable and
well-planned tests that produced _no_ actionable results. Contrast that
against a test that did produce actionable results. Then compare what went
wrong and what, if any, lessons can be extrapolated. That would make for an
interesting---and possibly meta---blog post.

~~~
paraschopra
Yep, that's a great idea. In fact, I had blogged about one in the past. Here
it is: [http://visualwebsiteoptimizer.com/split-testing-blog/left-
vs...](http://visualwebsiteoptimizer.com/split-testing-blog/left-vs-right-
sidebar-which-layout-works-best/)

