
Stop A/B Testing And Make Out Like A Bandit - zackzackzack
http://untyped.com/untyping/2011/02/11/stop-ab-testing-and-make-out-like-a-bandit/
======
btilly
This is an interesting suggestion, but I have significant questions that I'e
like to see answered.

1\. Most multi-armed bandit algorithms assume that the potential reward for
each lever is the same each time you pull it. Unfortunately web traffic does
not look like this - there are daily, weekly, and monthly cycles in conversion
characteristics, with large random fluctuations on top. An A/B test can ignore
this - who got the better traffic is just another random factor that comes out
in the statistics. How much would this impact a multi-armed bandit approach?

2\. My understanding is that multi-armed bandit algorithms assume that
feedback is instantaneous - pull the lever and get the answer. But this is
often not true. You send a whole batch of emails before getting feedback on
the first. Depending on the business, incoming users can take time to convert
to paying customer. I've seen places where the average time to do so was
weeks. What decision should be made in that time period of uncertainty? Worse
yet, what if one version speeds up conversions relative to the other? I know
how to tweak an A/B test to handle this issue (just wait then look at cohorts
that should have converted under either), I don't know how to modify a multi-
armed bandit algorithm to do so.

3\. As a practical matter, companies don't want to keep tests going
indefinitely. There is a real technical cost to maintaining a complicated mix
of possible pages that can shown. You want losers to be removed from your code
base. A/B testing is well-suited to doing that. A multi-armed bandit approach
does not.

4\. Most companies don't even do A/B testing correctly. I fear that pushing a
more complex scheme makes them less likely to try it, and increases the odds
of mistakes.

~~~
noelwelsh
Good questions. I'll do my best to answer them.

1\. We assume that the distribution of rewards for each lever is fixed (over
the short term). This allows the reward to vary randomly so long as the
average reward (over the short term -- days to weeks) is constant. There are
more complex schemes, which allow for greater variation in reward, but the
initial Myna offering is intended to be directly comparable to A/B testing.

2\. It's not necessary to assume that feedback is instantaneous. Basically you
can continue to make suggestions (pull levers) in proportion to your best
estimate of their expected return and the maths holds. Very long conversion
cycles will cause problems for any system, I think, as you'll spend a long
time in a random walk. In these cases we recommend using a proxy measure which
is correlated with conversion, if one is available. As for one option speeding
up conversion, I don't think that will matter as you'll simply refine your
estimate of one lever faster. I haven't thought too much about this particular
issue; it would be worth doing some simulations to see.

3\. Just turn off the bandit when you've satisfied with the results. That is,
you can use Myna like A/B testing, and the Myna UI displays confidence bounds
for this very reason, while still getting the benefits of optimising as data
arrives.

4\. I actually like bandit algs as there is less for the user to mess up. You
don't have to worry about how much data to collect, what p value to use, and
so on. Just set it running and it optimises automatically.

~~~
btilly
Further questions.

1\. My experience is that this assumption can be significantly mistaken. I
have seen significant daily and weekly variations in conversion rates. (They
average out after a bit, but they fluctuate.) Still a small tweak could help.

2\. Please do look into this one, the difference is not small. I first
encountered the differential timing issue with a test of the benefits of
adding a phone touch point on top of an email cycle. That extra contact moved
conversions up by weeks. So after the first month it looked fantastic, but
then slowly degraded over time. (There was improvement, but not enough to
justify the expense.) A careful cohort analysis showed that it was not
worthwhile at a point where a naive A/B test was still showing the extra
touchpoint winning by a very significant margin.

3\. That's reasonable.

4\. In practice during A/B testing people don't worry about those things
either. They just start the test, and later declare a winner. (People are much
more sloppy about it than they theoretically should be...)

------
wpietri
I'm probably just missing something. But for us the purpose of A/B tests isn't
really optimization; it's learning. We have a hypothesis about how to improve
something and we try it out. The most valuable tests for us are the ones that
don't work, because they force us to go back and think things through again.

A magic multivariable optimizer seems fine for the kinds of things a human
won't be thinking about (e.g., most interesting tweets this hour and the best
ads to show next to them). But from this article I'm not seeing an advantage
in using a similar mechanism for testing product hypotheses.

------
nkh
This article needs a patio11 bat signal.

I would be very interested to get his thoughts on this, and he didn't comment
on it last time it was posted.

~~~
patio11
Looks nice. Happy to see more optimization in the world. Have not dug into the
math of it enough to appreciate that aspect yea or nay. Think claims of
superiority over A/B testing are moot unless it successfully fixes the biggest
problem with A/B testing, which is that people don't A/B test. Don't feel
burning need to implement for myself.

How's that?

------
judofyr
8 month ago: <http://news.ycombinator.com/item?id=2831455> (65 comments)

~~~
noelwelsh
Well, maybe we can add to that discussion.

I say that as author of the blog post, and the founder of Myna, which is an
implementation of the ideas described therein. And yes, I'm totally hoping
this post stays on the front-page so we get more hits. (Myna is:
<http://mynaweb.com/>)

Now that Myna is out (though still in beta) I'm super interested in discussing
it with anyone who is interested.

~~~
swah
Did you choose at the bird image in your front page using Mina? (I'm asking
because normally we see successfull people in that place and supposedly it
converts well).

~~~
josscrowcroft
I dunno why, but I do feel a Content Optimisation tool's homepage could be a
little more ... content optimised.

Having said that, loved the article, fascinating stuff!

~~~
noelwelsh
I agree. It is kinda embarrassing, but we have limited time and have to focus
where we think we'll get the most bang-for-buck.

------
cschmidt
I know about the Gittins index to solve Multi-armed bandit problems, from
1979.

<http://en.wikipedia.org/wiki/Gittins_index>

I thought that was supposed to be optimal in some sense. How does the paper
cited in this post improve on that?

~~~
noelwelsh
The kind of optimality we're talking about is the up-to-constant factors /
Big-O kind. The Gittins index has better constant factors than UCB-1. However,
UCB-1 can be computed easily whereas Gittins indices are very expensive to
compute.

------
wccrawford
He'd lose that bet about the supermarket. Produce is in the far back corner,
the bakery is in the front corner near the entrance, and the dairy is on the
other side of the store in the back. 1 out of 3. I win.

But I doubt that it's laid out that way to make people walk all the way across
the store. It's that way because there's only so many walls, and those
departments need space that the customers can't enter. Meat and seafood are
along the rest of the wall at the back for the same reason.

------
zachallia
did you a/b test that default styled sign up button?

~~~
noelwelsh
Zing! :)

No, we haven't. We're not in a growth stage yet. On a normal day (this is not
a normal day) Myna doesn't get very much traffic. We've already validated the
concept with earlier beta testers and are now refining the offering. Once
we've done that we'll by trying to drum up more traffic and start optimising.

~~~
zachallia
haha sweet, i figured. was just being an ass.

