
20 lines of code that beat A/B testing every time - spiffytech
http://stevehanov.ca/blog/index.php?id=132
======
wisty
A stevehanov.ca link? Wow, HN is getting classy again. Please more articles
with code, equations and / well visualizations, and less upvoting of badly
thought out infograpics (i.e. pretty numbers which would lose nothing by just
being presented in a table) and far less self-help pseudo business articles
please.

+1 on an article does not mean "I agree". It means "I learnt something".

~~~
timr
Bandit optimization has been discussed previously on HN:

<http://news.ycombinator.com/item?id=2831455>

The problem with this article is that it's a) not very detailed, and b) the
conclusion is linkbait. Bandit optimization is a useful tool, but it has
drawbacks, and it's not _always_ better than A/B testing. In particular,
bandit approaches take longer to converge (on average), and don't give you
reliable ways to know when to stop testing (when all you know is that you're
using an approach that's optimal in the limit of a large N, your only
guarantee is that things get better as N gets large). These techniques also
make assumptions that aren't valid for a lot of web experiments: identical
"bandit" distributions that are constant over time. Throw a few choices that
are optimal at different times of day/week/month/year at a bandit optimizer,
and it'll just happily fluctuate between them.

Also, there's a lot of variation in performance depending on the parameters of
your test -- some of which are completely unknowable. So if you want to really
learn about this method, you need to read more than a blog post where the
author has concluded that bandit optimization is the new pink. For example,
here's a pretty readable paper that does an empirical analysis of the various
popular bandit algorithms in different paramterizations:

[https://docs.google.com/viewer?a=v&q=cache:KgmC8CnPhxwJ:...](https://docs.google.com/viewer?a=v&q=cache:KgmC8CnPhxwJ:www.cs.mcgill.ca/~vkules/bandits.pdf+&hl=en&gl=us&pid=bl&srcid=ADGEESjCXEToqqEjlrUIYeKyWtVpgBe4edd4wNeBoFqsTouBIiwPSoAIqP4iZCWXZeygjkzhchKEm6lZYwhCY3RMtg2JrD4Zr9Cge6IO9QH9QZ1Lx25Ee5H8OEHOAb2I0g5Z2NpWwe7P&sig=AHIEtbSqosRQf7c82ZCcUOXYgSo0QJ2IJw)

This is just one article, but there's tons of literature on this problem. (In
fact, if you use the 'softmax' criterion mentioned in that article, you're
doing something very similar to simulated annealing, which is a rather elderly
optimization technique.)

~~~
btilly
The really scary drawback is what happens if the bandit prefers a suboptimal
choice at the same time that you make an independent improvement in your
website. Then the bandit is going to add a lot of data for that variation, all
of which looks really good for reasons that have nothing to do with what it is
supposed to be testing.

This type of error (which can happen very easily on a website going through
continuous improvement) can take a _very_ long time to recover from.

A/B tests do not have an issue with this because all versions will have
similar mixes of data from before and after the improvement.

~~~
3pt14159
I might not be understanding you correctly, but wouldn't the independent
improvement also help the random bandit choices? If you are using the
forgetting factor this shouldn't be a real issue.

My problem with the bandit method is that I want to show the same test choice
to the same person every time he sees the page so you can hide that there is a
test. If I do this with the bandit algo then it warps the results because
different cohorts have different weightings of the choices and differing
cohorts behave very differently for lots of reasons.

~~~
btilly
The independent improvement also helps the random bandit choices. The problem
is that you are comparing A from a largely new population with Bs that are
mostly from an old population. It takes a long time to accumulate enough new
Bs to resolve this issue.

A forgetting factor will help.

This is a variant of the cohort issue that you're talking about.

The cohort issue that you're talking about raises another interesting problem.
If you have a population of active users, and you want to test per user, you
often will find that your test population ramps up very quickly until most
active users are in, and then slows down. The window where most users are
assigned is a period where you have poor data (you have not collected for
long, users have not necessarily had time to go to final sale).

It seems to me that if you want to use a bandit method in this scenario, you'd
be strongly advised to make your fundamental unit the impression, and not the
user. But then you can't hide the fact that the test is going on. Whether or
not this is acceptable is a business problem, and the answer is not always
going to be yes.

------
btilly
This is thought-provoking, which is good. However there are significant issues
with the approach.

1\. Real world performance varies over time. For instance there are typically
daily, weekly and monthly conversion rate fluctuations. Not an issue for A/B
testing, but a big issue for this approach if a random switch in direction
happens at the same time that conversion fluctuations happen to head in a good
direction.

2\. (This is really a special case of #1 - but a very, very important special
case.) This approach creates long-lasting interaction effects between tests
and independent changes. That requires explanation. Suppose you're running a
test. Version A is somewhat better. But version B is temporarily looking
slightly better when you make a significant improvement to your website (maybe
you started another test that works). Now you're adding a lot of good traffic
to version B (good because of the other change) and very little of the new
good traffic to version A. This new version B traffic soundly beats your old
version A traffic. This correlation between time and website performance will
continue until the old version A traffic is completely swamped by new version
A traffic. With only 5% of your traffic going to version A, this can easily
take 100x as long as your test has been running - or more. (Properly
constructed A/B tests do not suffer this statistical anomaly.)

3\. Code over time gets messy. One of the most important characteristics of
A/B testing is that you can delete the mess and move on. With this approach
you can't - it just hangs around adding to your technical debt.

4\. Businesses are complex, and often have multiple measures they would like
to balance. For instance in a recent test, conversion to click was hurt,
conversion to a person who clicked 5x was helped. A/B testing let us notice
that something weird was going on and think about what we really cared about.
This automated approach would make a decision and could have hidden a real
problem.

5\. Many tests perform differently on existing users and new users. A/B
testing with proper cohort analysis can let you tease this out and decide
accordingly. This approach doesn't give you that kind of sophistication.

~~~
pbreit
I don't think any of those points are very true.

~~~
btilly
I've been involved with A/B testing for nearly a decade. I assure you that
none of these points are in the slightest bit hypothetical.

1\. Every kind of lead gen that I have been involved with and thought to
measure has large periodic fluctuations in user behavior. Measure it, people
behave differently on Friday night and Monday morning.

2\. If you're regularly running multiple tests at once, this should be a
potential issue fairly frequently.

3\. If you really fire and forget, then crud will accumulate. To get rid of
that you have to do the same kind of manual evaluation that was supposed to be
the downside of A/B testing.

4\. Most people do not track multiple metrics on every A/B test. If so, you'll
never see how it matters. I make that a standard practice, and regularly see
it. (Most recently, last week. I am not at liberty to discuss details.)

5\. I first noticed this with email tests. When you change the subject line,
you give an artificial boost to existing users who are curious what this new
email is. New users do not see the subject line as a change. This boost can
easily last long enough for an A/B test to reach significance. I've seen
enough bad changes look good because of this effect that I routinely look at
cohort analysis.

~~~
zader
What do you think of Myna, in these respects? Does it suffer from the same
disadvantages as other bandit optimization approaches?

<http://mynaweb.com/docs/>

~~~
btilly
_Does it suffer from the same disadvantages as other bandit optimization
approaches?_

Yes.

That said, the people there are very smart and are doing something good. But I
would be very cautious about time-dependent automatic optimization on a
website that is undergoing rapid improvement at the same time.

------
rauljara
This is an interesting technique, but it too has flaws. If there is a period
of buzz and excitement surrounding your app, whatever design was most popular
at that time will be rewarded accordingly, and accrue a high click through
rate with tens of thousands of case. If you introduce a new superior design
after the period of buzz has gone away, the new design may take a very long
time to catch up. Even though it is currently outperforming the old, the few
hundred cases that are being added won't be enough to offset the tens of
thousands that came before.

With all metrics, it's important to understand what's actually going into the
measure and where it might get tripped up.

A potential solution might be to add a decay factor, so that the older data
carries less weight.

~~~
IgorPartola
Better than a forgetting factor, add a Kalman filter
(<http://en.wikipedia.org/wiki/Kalman_filter>). This way you can trust your
"new" data more than really "old" data, etc. The beauty of it is that it only
adds three attributes to each data sample.

~~~
treeface
Could you expound on this a bit? What attributes would you have to add? How
would you calculate scores?

~~~
IgorPartola
You would add a variance (P), estimate of the value and the timestamp of the
last measurement. Using the last timestamp you can calculate Q. Generally, the
older the last measurement, the higher Q.

The calculation is straightforward once you let some things be the value of
identity:

    
    
      P1 = P0 + Q
      K = P0 / (P0 + R)
      x1 = x0 + K * (z - x0)
      P1 = (1 - K) * P0
    

Now you have the new score for your data (x1) and a new variance to store
(P1). Other values are:

x0, P0 - previous score, previous covariance Q - Roughly related to the age of
the last measurement. Goes up with age. R - Measurement error. Set it close to
0 if you are sure your measurements are always error-free. z - the most recent
measured value.

Let's say you measure number of clicks per 1000 impressions. Now you can
estimate the expectation value (x1) for the next 1000. After the second 1000
re-estimate again.

~~~
treeface
Thanks for explaining that!

------
equark
This is the most important critique of A/B testing. It far outweighs the
traditional hoopla about simultaneous inference and Bonferonni corrections.

Epsilon greedy does well on k-armed bandit problems, but in most applications
you likely can do significantly better by customizing the strategy to
individual users. That's a contextual bandit and there are simple strategies
that to pretty well here too. For instance:

<http://hunch.net/?p=298>

<http://hunch.net/~exploration_learning/main.pdf>

[http://web.mit.edu/hauser/www/Papers/Hauser_Urban_Liberali_B...](http://web.mit.edu/hauser/www/Papers/Hauser_Urban_Liberali_Braun_Website_Morphing_May_2008.pdf)

------
noelwelsh
First up, the sales pitch: we provide bandit optimisation SaaS at Myna:
<http://mynaweb.com> Now, that's out of the way, let's discuss the article.

I like the epsilon-greedy algorithm because it's simple to understand and
implement, and easy to extend. However, to claim "The strategy that has been
shown to win out time after time in practical problems is the epsilon-greedy
method" is false. The standard measure of performance is called regret. You
can think of it as the number of times you choose the sub-optimal choice. It
is clear that this grows linearly in e-greedy, as there is a constant
probability of exploring. The same is true in A/B testing (you show 1/2 the
people the suboptimal choice in the data gathering phase and then make a
decision that you have some probability of getting wrong.) A good bandit
algorithm has regret that grows logarithmically with time, which is a huge
difference! This result holds out in practice as well. If you look at some of
Yahoo's papers (John Langford, for example; sorry no links as writing this
while getting the kids ready!) you'll see comparisons to e-greedy where they
significantly out-perform it. We've had the same results in our testing.

~~~
cbsmith
Yeah, I think the problem here is that trying to be a little bit smart kind of
gets you in to the space where really you should be doing things a LOT smart.
A/B testing provides data that doesn't require much in the way of brains to
interpret and is hard to draw poor conclusions from (beyond treating something
as statistically significant that is not). Once you step off in to epsilon-
greedy, you fall in to the whole reinforcement learning space.

To that end, btw, I think a service like yours is potentially quite valuable!

~~~
conductrics
Actually, you kind of are already in the RL space when using AB testing to
make online decisions, you just may not be thinking of it that way. From
Sutton & Barto "Reinforcement learning is learning what to do--how to map
situations to actions--so as to maximize a numerical reward signal." That is
exactly what you are doing when applying A/B style hypothesis testing to
inform decisions in an online application. Plus, personally, I think A/B
testing is, in a way, much harder to interpret, at least most folks interpret
wrong, which isn't a knock, since it is provides a non-intuitive - at least to
me ;) - result.

~~~
wpietri
Maximizing a numerical reward signal is definitely not what we're doing when
we do an A/B test.

We collect a variety of metrics. When we do an A/B test, we look at a all of
them as a way of understanding what effect our change has on user behavior and
long-term outcomes.

A particular change may be intended to effect just one metric, but that's in
an all-else-equal way. It's not often the case that our changes affect only
one metric. And that's great, because that gives us hints as to what our next
test should be.

~~~
conductrics
Well I guess you could be running a MANOVA or something to test over joint
outcomes, but the AB test is over some sort of metric. I mean, when you set up
an experiment, you need to have defined the dependent variable first. Now,
after you have randomly split your treatment groups you can do post hock
analysis, which I think is what you are referring to. But if you are
optimizing, here needs to be some metric to optimize over. Of course at the
end of the day the hypothesis test just tells you prob(data or greater|
null=true) which I am not sure provides a direct path to decision making.

------
steve8918
I actually noticed that Google was doing this with my Adwords account a couple
of weeks ago.

I have 2 ads in an Ad Group, and the wording between the two is different by a
single word. One ad had over double the clickthru rate than the other one,
just because of that single word difference.

I noticed that Google was serving the two ads about 50% of the time, and was
going to shut off the one ad that had the lower CTR, but then I let it go, and
the next day, I saw that the more successful ad had almost all the views, and
the less successful one was barely being served.

~~~
TimJRobinson
Yea this is an option in the Adwords interface, it is enabled by default but
can be turned off. Google probably discovered that most of their advertisers
don't check on their ad results very often and added the auto optimization so
that they could show the better ads and make more profits even without the
users interaction.

------
robertskmiles
The 'set and forget' aspect of this is appealing. I've sometimes wondered if
you could automate the whole thing, including option generation. If you can
define good enough mutation functions you could have your features literally
evolve over time, without developer input. You'd need a lot of throughput to
get reasonable evolution rates though. Jacking up the mutation rate won't help
because really big mutations will break the layout.

It's almost certainly impracticable, but fun to think about.

~~~
krupan
I'd love to see a website designed entirely by statistical machine learning
:-)

~~~
olefoo
Google leans pretty heavily on machine learning for even trivial design
choices <http://stopdesign.com/archive/2009/03/20/goodbye-google.html>
mentions the infamous testing of 41 shades of bue to decide which was the
correct one.

But truly someone has to design the learning system, at some level you will
always have design, if it is the design of how the design is to adapt to it's
environment.

------
conductrics
This is part of a larger class of problems known as reinforcement learning
problems. A/B testing when used for decision optimization can be thought of
(sort of) as just a form of bandit using an epsilon-first approach. You play
random until some threshold (using some sort of arbitrary hypothesis test),
which is the learning period, then you exploit your knowledge and play
estimate best option. Epsilon-greedy is nice because it tends to work well
regardless, and isn't completely affected by drift (nonstationarity of the
environment). One heuristic to use for deciding between using a bandity
approach is to ask , is the information I will glean perishable or not
pershible? For perishable problems the opportunity cost to learn is quite
high, since you have less time to recoup your investment in learning (reducing
the uncertainty in your estimates). Also, finding the optimal answer in these
situations may be less important than just ensuring that you are playing from
the set of high performing actions. We have a couple of blog posts on related
issues <http://www.conductrics.com/blog/>

------
aaronjg
There are appropriate solutions to the multi-armed bandit problem, and a
wealth of literature out there, however this is not one of those solutions.

Here's a simple thought experiment to show that this will not 'beat A/B
testing every time.' Imagine you have two designs, one has a 100% conversion
rate, one has a 0% conversion rate. Simple A/B testing will allow you to pick
the the winning example. Whereas this solution is still picking the 0% design
10% of the time.

For some other implementations check out the following links:

For Dynamic Resampling:

<http://jmlr.csail.mit.edu/papers/volume3/auer02a/auer02a.pdf>

<http://www.mynaweb.com>

For Optimal Termination Time:

[http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-
te...](http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-testing)

~~~
phildeschaine
OK, in AB testing that same 0% design, you're showing it 50% of the time.

You seem to be saying "I'll AB test it just for a little, then weed out the 0%
one. but in the case of this new algorithm, I'll let it run for a long time."
That's not exactly fair. Not to mention, both algorithms would allow you to
clearly see the 0% option sucks.

~~~
chc
But the only way this testing method is superior (at least as explained in the
article) is that it automatically adjusts itself. If you're going in and
adjusting manually, it sounds like this is — at best — precisely as reliable
as A/B testing and subject to the same critique the OP levels at A/B testing.

~~~
Vindexus
"But the only way this testing method is superior (at least as explained in
the article) is that it automatically adjusts itself."

That's actually very useful for me though. Especially if a site has a lot of
tests, or I'm running tests for a multitude of clients. It means I have to
babysit the tests less frequently.

------
shadowmint
I like this.

See, we've spent the last year doing an Omniture implementation, using T&T,
and frankly, its been a waste of money and time.

This isn't a criticism of the T&T tools; its a criticism of our own internal
analytics handling and analysis team, and possibly the implementation guidance
we received from a few places.

You can guess at generalizations from the A/B/N changes you've made and try
them again, but practically speaking? Meh. It seems like the learnings from
one page are very hard to transfer to another page.

That's why you don't see posts like "Top ten generalizations about how to make
your website better!" instead you pay "SEO Expects" and "Analytics Ninjas" as
consultants and they tell you things like "pages with images convert more,
generally, but that may not be the case for your specific website because of
blah". Handy. Do you have any more generic and obvious advice that doesn't
tangibly translate into practical changes to my website?

Here's the thing: Running an A/B is easy, but analyzing the results is really
hard. Generating some kind of _general_ analytics rules is _very hard_.

The reason I like this approach is simple: it's easy. It's easy to implement.
It's easy to explain. It's easy to convey to the designers that you're doing a
'throw at the wall and see what sticks' approach to pages. It's easy to
explain to managers how why the best page has been picked. You don't need a
self important analytics ninja who will go on endlessly about user segments
and how if you segment the data differently you can see kittens in the page
view statistics.

Just make lots of pages, and see what happens.

It's not perfect, but it's enough to get started~ (...and honestly, that's the
most important thing when you're working with analytics; it beats the hell out
of a 3 month implementation cycle that ends up with... page views <\-- to be
fair, personal opinion about uselessness of implementation not presently
shared by marketing department, who likes the pretty graphs. Whatever they
mean.)

------
joelthelion
Add the UCT monte-carlo tree algorithm, and you have a strong, generic AI for
two-player games such as chess, go, othello, etc.

Here's a C++ implementation: <https://github.com/joelthelion/uct>

------
heyitsnick
Maybe i'm missing it (it's late), but nowhere in the article does it explain
why 10% of the time it picks a choice at random ("explores"). In fact, the
article basically argues why it's not needed (it self-rights if the wrong
choice becomes temporarily dominant). It also doesn't explain why specifically
it should be a 10% randomization.

~~~
daeken
A random choice is needed to allow people to give rewards to options other
than the dominant. I'm sure this doesn't have to be random -- and I'd be
curious to see the logic behind the 10% choice -- but you have to have
something that gives the other options a chance.

Makes me wonder if the 10% number couldn't be changed to something that's a
function of the number of rewards total; the longer it runs, the less
variation there is and the more confident you are in the choice made.

~~~
nerdo
That would be Epsilon-decreasing strategy or VDBE:

[http://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-
uniform...](http://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-
uniform_strategies)

------
conductrics
I think rather than get hung up on e-greedy vs. A/B testing vs UCB (Bayesian
vs. non Bayesian), it is helpful to first step back and think about the larger
problem of online learning as a form of the prediction/control problem. The
joint problem is to 1)LEARN (estimate) the values of possible courses of
action in order to predict outcomes. and 2)CONTROL the application by
selecting the best action for a particular situation.

I noted elsewhere that A/B can be though of as an epsilon-first learning
approach, Play random 100% till P-value<alpha, then play greedy(play the
'winner'). As an aside, it is unclear to me how using p-values is a clearer,
easier, or more efficient, decision rule for these types of problems. It is
almost always misinterpreted as the Prob(B>A|Data), choice of alpha determines
threshold but is arbitrary, and often a straw-man default - implicitly biasing
your confusion matrix. Not saying that you won't get good results, just that
it is not clear that is a dominate approach.

This simple post I wrote on agents and online learning might be informative
[http://mgershoff.wordpress.com/2011/10/30/intelligent-
agents...](http://mgershoff.wordpress.com/2011/10/30/intelligent-agents-for-
analytics/)

But don't take my word for it (disclaimer: I work for www.conductrics.com,
which provides decision optimization as a service) take a look at a great
intro text on the topic by Sutton & Barto
[http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-
book.ht...](http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html)

------
yahelc
Seems like what Myna is doing: <http://mynaweb.com/>

~~~
kevinpfab
Conductrics as well: <http://conductrics.com>

------
TimJRobinson
I created a tool a few years ago built on a similar strategy, but instead of
only showing the best performing variation the chance of each variation
showing was based on how well it was converting (so in an a/b/c test with
conversion rates of 3%/2%/1% version a would show 1/2 of the time, version b
would show 1/3 of the time and version c would show 1/6th of the time).

There was one major flaw with this strategy though:

Lets say you're testing a landing page and have had 1000 visitors and version
A is converting at 40% while version B is converting at 30%. So it looks like
so:

Version A - 200 / 500 - 40% Version B - 150 / 500 - 30%

A new affiliate comes on board and decides to send 200 visitors to your page
from some "Buy 200 visitors for $1" domain redirection service. These visitors
are such low quality that they will never ever buy anything and will probably
just close the window immediately (or are bots). Now your results look
something like this:

Version A - 200 / 680 - 29.4% Version B - 150 / 520 - 28.8%

And with just 200 visitors some random affiliate has killed all your results.
Now you could add filtering and reports based on the affiliate or traffic
source but this is more code and more attention you have to pay to the test.

If you were running a traditional A/B test your results would look like this:

Version A - 200 / 600 - 33% Version B - 150 / 600 - 25%

And even though the overall conversion rate is lower you can still see version
A is better than B.

The idea is good and I love the idea of auto optimization, but it does have
it's flaws which require more than 20 lines of code to overcome.

~~~
conductrics
You might want to look at botlzman/softmax if you want to weight the prob of
selection as a function of the current estimated value. One tricky bit is
figuring out a good setting for the temperature parameter. Another poster
alluded to softmax. In my experience it dosn't really perform better than a
simple e-greedy approach, but maybe it has worked well for others?

------
nphrk
There are better approaches for tackling this problem (with 0-regret
asymptotically). You can take a look at the UCB (Upper Confidence Bound)
algorithm, and you can do even more if you assume some continuity, e.g. what
is commonly done is to assume that the whole distribution is from a Gaussian
Processes. Many interesting ideas in the literature indeed :)

------
6ren
ASIDE: I love the idea of combining this kind of approach with a random site
generator. You can then totally automate the business, from inception onwards.
If incorporated as a company, it's an artificially intelligent artificial
person.

A problem with this (well one of them) is that it could home in on the worst
of spammy techniques, like masquerading as a legitimate message, shaking
animation, illegal claims and "simple" lies/promises. A solution is to filter
these out - one could hand-code specific cases, but a general solution seems
as difficult as AI in the first place, and requires external knowledge (such
as of infrastructure messages, human visual physiology, laws and their
interpretation, the concept of lying). It's hard to gather data (e.g. each
lawsuit); but maybe something along the lines of a simple "complaint" button
would do (use statistics to discount accidental/abusive clicks).

~~~
zawaideh
something along the lines of giving the bad behavior a huge negative reward
(penalty) and you end up automating that as well :).

------
robomartin
A lot of the reasons for which this algo can optimize in the wrong direction
have been covered here. In general terms, I agree with all of it.

I am a big fan of two techniques that I feel would enhance an approach like
the one suggested: FIR filters and Decay.

Simply put: I believe it is important to have a mechanism through which
decisions are not made too quickly. A finite impulse response filter would
take care of this very well.

In addition to that, older measurements should not carry the same importance
as nice-fresh data. Who cares what people thought about the buttons (or
whatever) a month or two ago? The crowd visiting the site this week could have
been affected by solar flares. Maybe they prefer something else.

Obviously you need enough traffic to be able to use such techniques.

Not sure it's worth the complexity in all cases.

------
micheljansen
Although I can imagine this works very well for ecommerce websites and other
things where there is a very obvious single measure of success, like for
example:

    
    
      * the user clicked the button
      * the user signed up
      * the user put something in their cart
      * the user paid X
    

In this case, it's easy to create the feedback loop that is required for this
testing method.

However, in the real world, things are not always that simple. What if you
want to optimise:

    
    
      * the percentage of users that returns to the site 
      * the time that users spend on your site
      * the number of pages that they view in a session
    

I'm sure some of these metrics can also be plugged back into the bandit
algorithm, but it's a lot more complicated.

------
toemetoch
Next step: close the loop.

Option 1: Let the next set of values (colors of the button in the example)
themselves be generated after n runs.

Option 2: let users modify the css as part of a "customization " feature. I
remember they did this before the new BBC site was officially launched a few
years ago.

------
K2h
This epsilon-greedy [1] thing looks just like what my old boss use to tell me,
'trust but verify'. ohh.. I got so sick of those words, but at least now I
have an algorithm for it.

Just change the epsilon = 0.1 (10%) higher or lower depending on your initial
(personal) confidence, and if your guess was right, and your epsilon low, then
the overall impact to 'optimal' solution is negligible, but you have built in
a fail safe in case you were human after all.

[1] <https://en.wikipedia.org/wiki/Multi-armed_bandit>

------
chargrilled
Bayesian bandits is also a very interesting approach:
<http://tdunning.blogspot.co.uk/2012/02/bayesian-bandits.html>

------
conductrics
Of course, of the environment is truely stationary, then the easiest simpest
hack method for exploration is just seed the initial values for each option
(A/B../Z) with a an optimistic guess (so something you know is higher than the
true value. Then just make decisions based on the current best estimate. The
estimates will be driven down over time to their true values. Not claiming you
should do this or that is optimal or anything but keep it in mind as a quick
hack to solve the problem.

------
Drbble
It's funny to see the startup crowd rediscovering age-old techniques... Seen
the Amazon.com home page recently? Does it look the same every day for a
month?

------
sravfeyn
I can draw many parallels between this and Genetic Algorithms. There is
percentage of probability in choosing next choice(In GA, children), and we
have highest probability for the most profitable(In GA, fittest) choices. The
solutions evolve. And the most profitable solutions (Most fittest in GA)
remains.

How is this different from Genetic Algorithms?

~~~
conductrics
A GA is a zeroth order optimization method. A Bandit is a type of decision
problem. So, bandit is a single state RL problem were one is trying to make
decisions in an environment in order to min regret. GA is a general
optimization approach when there is no gradient or second order info about the
problem to use. Take a look at XCS classifiers for an approach that can solve
bandit type problems, but uses GAs to estimate the mailings between features
and rewards.

~~~
sravfeyn
Thank you.

------
viggity
what?!? No patio11 comment?

~~~
patio11
I upvoted btilly's at the top of the thread, considered saying "FWIW: my
opinion is _this_ , particularly #2", and then decided that added very little
value.

------
loceng
This data isn't fully valuable unless you know what alternate behaviour the
users are performing (instead of clicking the alternate-coloured buttons).
They still could be staying on your site, just not following the same funnel.
They could be return visitors as well.

~~~
DivisibleByZero
The example highlights the reward of button pressing, but most likely you
would be measuring your "success" differently in your own application. You
could add different behavior categories and weight each accordingly to how
much it is worth to your product.

~~~
loceng
Right. If you totaled the importance values then you could see a nice funnel.

------
ArekDymalski
There's one thing that keeps me concerned. Time (actually number of responses
required to pick up the best option). Please correct me if I'm wrong, but I've
got a feeling that this process requires more displays to effectively
determine 'the best' option.

~~~
ComputerGuru
That's actually easy enough to fix. You trade off between certainty and time
by setting the initial confidence values.

For instance, if you're testing A, B, and C; you can start off with
success/total values of 1/1 for each or 100/100 (for extreme values). If you
start off with 1/1, a single hit or a single miss will swing the algorithm
quickly and heavily in that particular direction; e.g. 1 miss for C results in
1/2, and brings its success rate down from 100% to 50% immediately, giving
instant precedence to A and B. Whereas if you used 100/100 to start, a single
miss for C would only bring it down to 100/101, letting the algorithm take
much longer to "settle," but with far more confidence.

The trick is in picking a number that suites your needs, e.g. for expensive
traffic sources (AdWords) pick smaller numbers to minimize the cost of the
experiment and for cheaper, more often sources use larger numbers because you
can afford the extra time to be sure.

------
blauwbilgorgel
<http://genetify.com/demo/> is an interesting related js solution.

<https://github.com/gregdingle/genetify/wiki/>

------
perssontm
This sounds really interesting, I might have been close to building this
before without realizing. :) Seems really easy to setup as well, will be
interesting to hear any counter-argument in the comments.

~~~
thenomad
Seconded. As a non-mathsy marketing/biz guy, my response was "whoa, sounds
great.", but with the caveat that my response to the actual statistical math
side was "Derp."

I'm really looking forward to hearing comments from people With Actual Maths!

------
wildmXranat
I've made it a habit to ready his articles on regular basis. There's always
something thought-provoking that sticks for a long time.

------
theDoug
There's a lot out there beyond A/B testing, most of what people are looking
for tends to be multivariate.

------
wdewind
This does require a significantly larger sample size for accuracy though.

~~~
TomGullen
A larger sample size than A/B? Why?

~~~
daeken
I don't know for sure, but it (intuitively) seems to me that you'd require a
larger sample size because only 10% of the time will it explore other options.
With A/B testing, you've got equal odds to hit either option, so each step is
independent of the last; with this, each step depends on the previous results
and thus if you have one option jump out into the lead (for whatever reason)
it'd make it less likely that the other gets a good sample.

------
rwhitman
Does anyone know of any split test products on the market that do this?

------
iworkforthem
Anyone has a js implementation of the multi-armed bandit algorithm?

------
tzaman
Can I get this in Ruby, please? :)

~~~
dremmettbrown
<https://github.com/bmuller/bandit>

~~~
duaneb
Great! Now how about COBOL?

------
its_so_on
I always wondered why the great people I know can do so much better than
having to do A/B testing in their own businesses. Sure they try new things,
but they are most certainly not applying an A/B type algorithm.

It seems the article mentions something quite important in their algorithm:
mostly do what has the most expected value. The people who are great just have
a much better way to judge that. They can make a poster with 20 design choices
(or 50 or 100) and make an estimate of what would probably work and what
probably wouldn't on each one, from the size of poster, to the font sizes and
types, whitespace, where to place different elements, graphics choices, etc
etc etc. They certainly include a random aspect, but this is the exception
rather than being the norm. Mostly what dictates choices is your expected
returns on them, and the random aspects are compared with these.

They do pay exquisite attention to their random choices: "This week I decided
to see what would happen if I mixed up the day, date, location, and
description rather than have it be in logical order, to see if this engaged
people any more" (or: to change any other choice randomly). But it is the
exception rather than the norm, and done rarely rather than often. They still
pay attention to the results, which helps inform their "expected value"
function.

(I didn't spend much time on the article or the linked papers, so please feel
free to correct me if I'm misinterpreting. A rigorous algorithm doesn't have
much to do with real-world choices, and we are simply nowhere near having an
automated web service to write your copy, _regardless_ of how many users get
to give feedback on it. So the whole thing isn't very interesting to me, and
the above is just my impression of 'why'.)

------
mayop100
Nothing new here. This is a common technique. Any email newsletter service
worth its salt has been doing this for years, and I think many A/B testing
tools have as well. A/B testing doesn't mean you'll assign 50% of users to
each version.

------
eta_carinae
There is no evidence in this article that

1) this approach is better than A/B testing. 2) it doesn't suffer from the the
Heisenberg principle

Also, the call out box at the top is obviously an ad but it's not marked
"Sponsored link", which is really scammy.

~~~
learc83
>Also, the call out box at the top is obviously an ad but it's not marked
"Sponsored link", which is really scammy.

That's because it's a link to a webapp the author built.

