

Out of the cesspool and into the sewer: A/B testing trap - DeusExMachina
http://blog.asmartbear.com/local-minimum.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blogspot%2Fsmartbear+%28A+Smart+Bear%3A+Startups+%2B+Marketing+%2B+Geekery%29

======
patio11
This gets mentioned quite frequently: avoid the local maxima, pivot pivot
pivot, etc. Mentioned less frequently is that it is really expensive.

I have a eight A/B tests currently running. Here's how long it took to launch
each one. Spot the big bang test: ten minutes, fifteen minutes, fifteen
minutes, 30 seconds, 15 seconds, an hour, three hours, three weeks.

Small A/B tests also have small risks associated with them. One of those eight
touches a single word on my website. It is almost inconceivable that that one-
word change could result in a POed customer sending me email. On the other
hand, the big bang test can cause (and has caused) customer support issues for
me, despite taking a great deal of time to minimize the impact.

In addition, big bang tests are often less conclusive than you want them to
be. My current Big Bang test is significant at 95% right now, against the
pivot I want to make. I am of two minds: on the one hand, I want to bow to
that inevitability. On the other hand, I'm seriously wondering whether it is
the pivot causing the disparity or if it is just implementation details of the
pivot. In a standard A/B test, the change _is_ the implementation detail.
However, given that I had to adjust something like 30 files, I'm wondering if
customers are really rejecting the pivot or whether they just think the
graphic I made for it is exceptionally hideous. (I started an A/B test for
just the graphic and it is, indeed, getting whupped versus what it replaced.)

Since the big bang A/B test frequently changes mountains and drops you at the
unoptimized bottom of the new mountain, you're left either trying to do
hillclimbing on two hills in parallel (which is OK, as long as you don't mind
your engineering team cutting out your intestines and using them to strangle
you) or doing the deeply unsatisfying "Well, I really feel better about B, so
we're going to hop over there and then start hillclimbing, then pat ourselves
on the back and assure ourselves it was the right decision all along."

~~~
webwright
I'm not sure the big bang is always testable. The question I'd ask myself is,
"How satisfied are we with the velocity of our business, given my goals?" The
goals here are important. One person's goals might be around lifestyle, low-
effort income, etc. Another person's goals might be around big $ liquidity or
getting enough traction/growth to attract angel funding.

If growth/success is not measuring up to one's goals, you have two choices.
You can A/B test your way up or you can do something dramatic (hire a
salesguy, pivot your product or positioning, etc). I think a lot of startups
with big dreams and crappy growth dive in with A/B tests when they should be
figuring out how to change how people talk/think about their product(s).

You're right about being expensive. The scarcest fuel for most teams is
optimism and confidence. One or two failed pivots can be devastating.

~~~
mathewgj
working backwards from your goals is the best advice i've heard in a while. in
the particular case where you've raised a lot of outside capital it can be
pretty easy to forget how big you are supposed to get and how quickly you are
supposed to get there to put yourself solidly in the success category.

+1 on dramatic!

------
josefresco
The main issue I battle with with A/B testing is backing out far enough to
know whether the change is related to the edits made or simply a
seasonal/daily/hourly/freak change.

I manage CPC campaigns for clients, and often times I make A/B changes in the
keywords/bids/ads and wait 30-60-90 days before measuring and making changes
based on the findings. And even with 90 days and tens of thousands of visits,
I find that sometimes my scope is too limited and I get nailed by
seasonal/yearly changes that the numbers didn't reveal.

A/B testing in 5K chunks works if everything else in the universe remains
constant, but here in the real world that's never the case.

The worst is when you make a change, measure results and implement said
change, and then something related to the economy (or even Weather) horks your
CAC and suddenly the client isn't so thrilled with your amazing A/B
improvements.

------
tel
Here's a thought: apply successive A/B testing in steps similarly to the
Metropolis-Hastings algorithm. The core sys that for each new test (each new
"jump") you evaluate the probability ratio a = p_new/p_old where p_x is the
probability that x is successful. If a is greater than 1, go with the new, if
it's less than 1, _still_ go with the new a% of the time.

By implementing parts of this algorithm you might be able to generate a random
walk scenario that is local minima resistant.

------
duck
_Instead of just running A versus incremental-change A2, also run a B version
that's radically different from A. Thus you reap the straightforward benefits
of incremental improvements while also searching for something that could
radically improve your revenue._

A very powerful concept that you don't see done very often.

~~~
samlittlewood
It is worth looking at "Simulated Annealing":

<http://en.wikipedia.org/wiki/Simulated_annealing>

Start your optimization process with a high "temperature" (leading to big
changes), and let it "cool" over subsequent iterations.

------
pierrefar
_Instead of just running A versus incremental-change A2, also run a B version
that's radically different from A._

It's like Google testing 40-odd shades of blue to see which performs better,
while failing to discover that the best color to use is (say) red.

------
ellyagg
A few months or a year ago a Google designer quit when he was forced to split
test one too many shades of blue. He described what a creativity sink the
uniform fidelity to that sort of approach is. Of course, in a large
organization there are always counter examples and counterpoints, but that
particular Google designer was responding to frustrations inspired by
recognition of this post's observations. Engineering has not yet reached the
point where split testing is always (or usually, in my opinion), better and
more efficient than the creativity of a trained expert.

~~~
studer
Or that's what he claimed. For some reason, he forgot to mention that he went
from a post-IPO behemoth to a pre-IPO company that was the hottest thing on
earth at that point...

