
Why multi-armed bandit algorithm is not "better" than A/B testing - paraschopra
http://visualwebsiteoptimizer.com/split-testing-blog/multi-armed-bandit-algorithm/
======
rmcclellan
Multi-armed bandit isn't an algorithm, it's a model of how to view the
problem. Like it or not, the problem web designers face fits the multi-armed
bandit model pretty well. The algorithm called "MAB" in the article is one of
many that have been developed for multi-armed bandit problems. Traditionally,
the "MAB" of this article is known as "epsilon-greedy".

The point of multi-armed bandit situations is that there is a trade-off to be
made between gaining new knowledge and exploiting existing knowledge. This
comes up in your charts - the "MAB"s always have better conversion rates,
because they balance between the two modes. The "A/B testing" always gain more
information quickly because they ignore exploitation and only focus on
exploration.

I should say also that multi-armed bandit algorithms also aren't supposed to
be run as a temporary "campaign" - they are "set it and forget it". In
epsilon-greedy, you never stop exploring, even after the campaign is over. In
this way, you don't need to achieve "statistical significance" because you're
never taking the risk of choosing one path for all time. In traditional A/B
testing, there's always the risk of picking the wrong choice.

You aren't comparing A/B testing to a multi-armed bandit algorithm because
_both_ are multi-armed bandit algorithms. You're in a bandit situation either
way. The strategy you were already using for your A/B tests is a different
common bandit strategy called "epsilon-first" by wikipedia, and there is a bit
of literature on how it compares to epsilon-greedy.

[http://en.wikipedia.org/wiki/Multi-
armed_bandit#Common_bandi...](http://en.wikipedia.org/wiki/Multi-
armed_bandit#Common_bandit_strategies)

~~~
KaoruAoiShiho
This comment just sold me on MAB. You can just keep on throwing variations on
a design at the system without having to make tenuous decisions. I hope all
the A/B tools implement this soon.

~~~
noelwelsh
Don't wait: <http://mynaweb.com/> [Yes, this is a shameless plug for my
startup.]

~~~
arkitaip
It seems like most people who use these content optimization tools don't
really understand the statistics involved. What are your thought on this? How
do you educate your users on the merit of your approach vs a/b testing when
the topic is so complex?

Also, despite this being a slightly pro a/b testing post, I have to say it's
actually made me more interested in trying out Myna's approach MAB algorithm.

~~~
Drbble
Same way every product from GWO and T&T on down: show a pretty graph that
ignores the underlying assumption that it's even possible to use statistics to
conjure certainty from uncertainty, and trust that users will never know or
care about the difference.

/former AB test software dev who fought my users to try to stop them from
misinterpretation results, and failed.

~~~
btilly
If it gives you comfort, if there is a significant underlying difference and
the calculations are done right, with high probability they will get the right
answer even though they are misunderstanding the statistics.

Acceptance of this fact has avoided a lot of potential ulcers for me.

------
jerf
I'd love to see someone who knows more explore this, but: Statistical tests
and processes can not be blindly applied. They all have prerequisites before
they mean anything. A common one is the assumption that the underlying
distribution is Gaussian. In that case, average + standard deviation tells you
something very useful. On the other hand, nothing stops you from taking the
average and standard deviation of a Poisson process, but the numbers will be
much less meaningful, and often quite deceptive.

I'm concerned that the assumptions that are necessary for those statistical
tests are not being met. Generally the significance tests are based on an
assumption that all samples are independent as far as I know, but MAB grossly
violates that assumption, and I do mean grossly. MAB doesn't pass those tests
because it is, itself, a statistical significance test in some sense, and it
is "deliberately" not feeding the independence-base significance algorithms
the data it thinks it should be getting. This is not a bug, it's a feature.

(Pardon the anthropomorphizing there. It's still sometimes the best way to say
something quickly in English.)

In fact, the fact that you agree that MAB has a higher conversion rate, which
in this context basically means nothing more and nothing less than _works
better_ , but that there's this measure Q on which it does worse, is probably
better interpreted as evidence that Q is not a useful measurement, rather than
that the thing that _works better_ shouldn't be used due to its lack of
Q-ness.

~~~
brockf
In this case, because you are measuring conversion vs. no-conversion, the
underlying distribution would actually be binomial. With a large enough sample
size, it looks similar to a Gaussian distribution, but still slightly
different.

Also, I don't see the problem of lack of independence. People are still
randomly assigned to one of the conditions in the MAB scheme, people in one
sample don't affect people in the other sample, and each person submits one
independent datapoint. The problem is just one of unequal sample sizes in your
two conditions, so when you make the comparison (in a one factor ANOVA, say),
you lose lots of statistical power because your effective sample size is
essentially the harmonic mean of the two sample sizes (which is heavily biased
towards the lower number of the two).

~~~
jerf
"People are still randomly assigned to one of the conditions in the MAB
scheme"

No, they're only _partially_ randomly assigned to one of the conditions. MAB
has a memory based on what the results of the previous runs are, which is
basically another way of phrasing that there's a dependence.

~~~
brockf
I'm not so sure. By that logic, no experiment has random assignment. I know
when I place subject X in condition Y that I do so because I need Z people in
condition Y. Thus, I have some information about where I've placed other
people. This is no different?

~~~
timr
The point is that with bandit models, you're placing people in a cohort as a
function of the previous conversion rate of your cohorts. That's the
dependence.

By contrast, randomly allocating X% of your population to cohort C_i is not a
dependence on anything but the free parameter (X).

~~~
Estragon
The results will still be conditionally independent given the assignments,
which is all you need to use the counts for estimation of a binomial .

~~~
timr
_"The results will still be conditionally independent given the assignments"_

If the assignments are the source of the problem, you can't just take them as
"given" and assume that things are OK.

A simpler example: if I assign people to cohorts based on their age, then the
results of my experiment may be "conditionally independent" with respect to
(say) eye color, but it probably won't be conditionally independent with
respect to income or reading level or pant size. In other words, _the
assumption of independence is violated with respect to all variables
correlated with age._

With bandit optimization, our bucket allocation scheme is correlated with
time, which makes it nearly impossible to apply conventional statistical tests
on any metric that may also be correlated with time. And in web testing,
that's nearly everything.

------
noelwelsh
When I see this "How soon statistical significance (of p-value < 0.05) was
detected (in a two tailed Z-test)", I think of <http://www.evanmiller.org/how-
not-to-run-an-ab-test.html>

tldr: The methodology is flawed.

~~~
paraschopra
It's not flawed. Wouldn't you assume that whatever bias (due to "peeking" at
results) was there for one algorithm would also be there for another? Also, to
offset this effect somewhat, I waited till I saw statistical significance at
least 10 times. (I know it's a heuristic) Moreover, I'm curious if there's a
theoretical result that compares statistical power of randomization and MAB.

Though like a commenter said, comparing these two algorithms is actually like
comparing apples to oranges, and that was precisely the conclusion.

~~~
rfergie
All you are saying is here is that you got it wrong in the same way for all
parts of the test.

Which may remove the bias (I'm not sure about this) but it doesn't inspire
confidence.

~~~
paraschopra
Yes, I agree it is not a rigorous statistical analysis but I tried doing a
fair comparison (based on assumptions/heuristics clearly described in the
post). The blog post simply claims what is described there.

------
darkxanthos
One glaring flaw in this article is that they're comparing a multi-arm bandit
with a 10% exploration rate to a simple random split test with an implicit 50%
exploration rate. Of course the split test will converge faster.

In their results when they compare a MAB with a 50% exploration rate to their
split test you start to see a comparable amount of time to converge. Also they
only show the results of one simulation with a lot of random in it. Given
we're all stats nerds it would've been handy to see a box plot for each of the
styles of simulation across multiple runs.

Having said all of that, my biggest concern around MAB as it's being sold is
the lack of thought around the experiments. In the end it's just data and it
doesn't mean anything without human intuition and preconceptions guiding it.
Example: day of week, time of year/day, new page people are getting used to
still, old one that garners lots of repeat traffic, etc. When running tests
there's a lot more to consider besides just "whatever the data says".

~~~
paraschopra
We ran each simulation 25 times (with 10000 iterations each). And can share
exact statistics in case you are interested. And yes, MAB with 50% exploration
rate did perform similarly, but it didn't generalize when there were more than
two versions.

------
jasonwatkinspdx
Experiment and Optimization are different topics and scoring one by the
criteria of the other will lead to mistaken conclusions.

Your simulation is a fiction. It only informs you of the behavior of the code
you wrote. To connect statistical significance to reality requires honoring
criteria you have not met.

Applying frequentest evaluation concepts to Bayesian and Markov learning
methods will lead to mistaken conclusions because they follow from a different
set of axioms. They both lead to numbers, but these numbers are not
interchangeable and you must remember their qualifications when applying them
to predicting reality.

In more frequentest terms, multi-arm bandit is searching for minimum total
regret via a random walk, not rapid convergence to a significance predicate.
It can do this in the face of changing conditions and weaker assumptions
concerning control than frequentest significance requires, which is why such
methods are now the norm in machine learning.

There are huge opportunities that you will not see with your current
perspective. If you do not learn them, your competitors will.

I do not know of a particular reference for multi-arm bandit algorithms in
specific, but they are a specialized case of Markov model learning described
with vocabulary that predates a more modern broad and uniform understanding of
these topics.

David Barber's recent book is very good.

If you want a broad understanding of common mistakes due to unstated
metaphysical assumptions concerning statistics, experiment and action, read
Judea Pearl's 2nd edition.

------
tel
Convergence to statistical signifcance is a bad metric (just like how early
stopping is such an issue) but you can map it to the rigorous parameter of
statistical power, perhaps. From this, it's well proven that t-tests are
statistically most powerful--there exists nothing better--given that you're
simulating disjoint normals.

In reality, however, tests like the Wilcoxin test are 99% as powerful and more
robust to misspecification in models.

I bring this up because while statistical power and signficance is a very
important metric in the theory of picking good tests, it's actually a pretty
terrible one in practice. Comparing MAB, which optimizes an entirely different
loss parameter, to t/z-tests on power is sort of meaningless.

MAB can produce a cleaner workflow for many kinds of websites. Underperforming
classes will be underrepresented and eventually pruned. The increased power of
a batch test isn't necessarily so important in this context. I'm not even
actually advocating MAB over other tests, just that you shouldn't spend too
much time worry about power comparisons unless you're genuinely comparing
apples to apples.

~~~
paraschopra
Yes, comparing MAB with A/B test is not accurate and that is what the
conclusion (in summary) is. These algorithms optimize for different things,
and our customers should know that.

------
birken
This blog post misses to me what is the most important difference: If you are
not keeping consistent assignment ratios of users into experiment buckets,
then you are implicitly making the assumption that no other factors other than
your A/B test can affect conversion.

A very common example would be you might have a form that converts higher on
the weekends than the weekdays just because of some external factor. You fire
up a test with this "multi-arm bandit method" on Friday morning with two
variations that are exactly equal. By random chance variation A starts winning
by the end of Friday, and then it gets shifted to 90% of impressions. Due to
this external factor, that it is now the weekend and that naturally increases
conversions, variation A's conversion rate is going to excel past that of
variation B, maybe even significantly. However, it in fact isn't better at
all.

Now this may be uncommon, but think about this: Your website is changing all
the time. If you happen to make a change on your website that affects
conversion rates and you do not keep consistent bucket assignment ratios, then
any ongoing experiment is going to be tainted.

I appreciate that the "multi-armed bandit" method increases the conversion
rates on your site during the test, but to do be honest when I am running a
test I'm not worried about increasing the conversion rate during the test. I'm
worried about increasing the conversion rates in the much longer future, and
the best way to accurately do that is to get accurate and significant results
from my test as quickly as possible.

This also bunks one of the advantages in the blog post that says you can add
variations at any time. You _cannot_ add variations at anytime, because if you
do you are comparing apples to oranges. The conversion rates from time period
A-->C cannot be assumed to be the same as B-->C, because there likely have
been changes in between A and B that could affect the "base" conversion rates
(like website changes, external factors like weekend vs weekday, etc). When
you add a new variation, you are creating a new A/B test that must be analyzed
independently.

~~~
btilly
That was exactly my point at <http://news.ycombinator.com/item?id=4040022> and
it is a valid one.

However I have been thinking about it since, and it is possible to design a
multi-armed bandit approach with logarithmic regret (though higher by a
constant factor than a traditional approach), that can handle the varying
performance ratio. It also would allow you to add variations at any time.

There remain operational differences, but this problem is fixable.

------
zader
Helpfully, the final paragraph of the article seems to contain its takeaway
message:

"So, comparing A/B testing and multi-armed bandit algorithms head to head is
wrong because they are clearly meant for different purposes. A/B testing is
meant for strict experiments where focus is on statistical significance,
whereas multi-armed bandit algorithms are meant for continuous optimization
where focus is on maintaining higher average conversion rate."

Like any business owner, at the end of the day I care more about conversion
rates than about statistical significance.

I'm not a scientist trying to advance the sum of humanity's knowledge. I'm a
business owner trying to find the shortest path between what my customers need
and what I can profitably offer them.

In a way, statistical significance strikes me as a bit of a fool's errand,
because significant results in one context may not be generalizable to
another, which means even if we know for a near certainty what worked best,
it's hard to apply that knowledge reliably in the future.

Of course, with MAB we could still wait for statistical significance if we
want it, before turning off variations that are performing worse. And we can
certainly still try to draw conceptually useful conclusions by designing our
test variations in ways that facilitate easy comparison.

But with MAB and Myna it sounds like we can pretty well count on higher
conversion rates at the end of the day, as well, and that counts for a lot in
a business context.

I'm grateful to the VWO team for writing up their analysis and findings, and
being so frank about the relative advantages of A/B and MAB. Their summary
above tells me what I need to know.

------
plinkplonk
Without concrete details about the experiment setup and dataset on which it
was run (and ideally the source code,so we can be sure there aren't any bugs)
this is essentially meaningless.

Anyone can run "simulations" to prove anything. Providing just a summary table
is of little use. I am not saying that the Wingify folks are trying to mislead
people - just that this article doesn't have sufficient rigor to justify its
conclusions.

OTOH, many CS papers, even published ones, don't provide source code or
datasets so people can replicate the results, so perhaps this is the 'new
normal' ;) .

~~~
paraschopra
Here's the code (quick-and-dirty): <http://pastie.org/4007859>

I had double-checked the code, but it is quite possible that I made an
oversight somewhere.

~~~
plinkplonk
I am not saying that your code has bugs (or that it hasn't!).

However running experiments to prove that one 'process' is superior to another
in a real world situation often involves more than running a chi square test
on randomly generated data ;) (which is what I understood your code to be
doing on a _very_ brief glance at it- sorry if I got it wrong).

There is nothing wrong with _starting_ your exploration with something like
this, but it really isn't sufficient to make the claims in the blog post about
specific ways in MAB is better/worse than A/B testing.

This _seems_ to be methodologically dubious. I am _not_ a stats expert, though
I use it in my work, and I could be wrong but testing _convergence to_
statistical significance doesn't seem to mean anything
(mathematically/statistically)!.

I could be wrong - there are people with Stats PhD's here and I'll leave it to
them to tell me if I am. But I've never heard of an experiment (in the formal
sense) which ran _till_ a significance checking algorithm crossed some
tripwire level.

Is the sampling random? (especially for an MAB) What are the underlying
distributions (and assumptions about them)?

In your article you (imo) either need to say something like "this test isn't
really valid and so the conclusions shouldn't be taken very seriously" (which
would be weird thing for a company with your product to blog!) or you should
do some rigorous experiment design and be able to defend your results.

Right now it _reads like_ some MBA type plugged in two black box formulae in
an Excel sheet and drew conclusions from the results. (please note: I am _not_
saying you did it that way -just saying that the article _reads like_ not
enough thought went into setting up a proper experiment)

(Statistical) Experiment Design is a lot of work and surprisingly hard to
carry off well, and involves much subtlety - and theory! there are whole books
written about it. Maybe time to buy a few? :)

~~~
paraschopra
Yes, I agree we should put the code on the post probably so people know what
exactly was done. Yes, it is a quick and dirty simulation, and not a rigorous
analysis. I hope someone more knowledgeable does shed light if the arguments
have any fundamental oversights. The post is mainly meant for our customers
who asked if MAB is "better" than A/B testing. It clearly isn't as both are
meant for answering different types of questions.

------
andreasklinger
I personally do not understand the need to compare bandit with A/B.

The goal of A/B is to decide as quickly as possible for the best one.

The goal of bandit is to optimize given content.

I wouldn't use bandit to "decide" the button color but as a simple
recommendation system. This seems by far more natural to me as it reacts
better with changing optima.

Example: Let's say I run a larger "fun content" media webpage. I have "awesome
videos", "funny images" and "goofy articles". On the bottom and right side of
each content page i show follow up content. I would use bandit here to
optimize the mix of images,videos,text i recommend.

To spice it up: I would create cohorts for my typical behaviour of users (e.g.
registered male user) and only consider interactions of the last two weeks
into my bandit calculations.

------
dremmettbrown
"A/B testing is meant for strict experiments where focus is on statistical
significance, whereas multi-armed bandit algorithms are meant for continuous
optimization where focus is on maintaining higher average conversion rate."

Who cares about conversion rates, right? I'd much rather have statistical
significance than more clicks on my buy button / signups for my site / etc.

~~~
Drbble
The point is that with less significance you are less confidential what your
conversion rate "would have been" under other courses of action. You can't say
"this or that test has higher conversion" because of some simulation that has
totally different structure than your real website.

Statistical about managing unknowns, not creating knowns, and everyone ignores
this!

------
btilly
(For previous discussion on this topic, see
<http://news.ycombinator.com/item?id=4040022.>)

Let's stop a bad meme before it gets going.

Multi-armed bandit is a general problem. There are many strategies for
tackling it. An epsilon-greedy strategy is one strategy. It is not the only
strategy, and it is not even a particularly good one. There are many others.
In fact A/B testing is itself a solution to the multi-armed bandit approach.

In fact the standard way to classify strategies is their asymptotic average
_regret_ \- how much time you spend pulling the bad lever. An epsilon-greedy
strategy has linear regret - even after a version has won you keep pulling it
a certain fraction of the time.

Let's consider an alternate approach. Run an A/B test, then forever after
exploit it. This is _also_ a linear regret algorithm. Given two levers and the
way you're going to run the test, there is a probability that you will come to
the wrong conclusion and forever after do the wrong thing. Most of the time
your regret is fixed, but sometimes it is not.

So they are equal? Not exactly. A/B testing converges quickly, and has a very
small amount of linear regret forever after. An epsilon-greedy approach
converges more slowly. Furthermore to lower the long-term regret you have to
drop epsilon, which slows convergence. By the time you lower it to a level
that is comparable to an A/B testing approach, convergence is so slow as to be
operationally infeasible.

Thus the conclusion of this article is that A/B testing is a better
operational strategy than epsilon-regret. And it is. But there are known
better strategies still. The algorithms that Myna uses, for instance, have
logarithmic regret. In a multi-armed bandit situation, they will soundly beat
A/B testing just as A/B testing soundly beats epsilon-regret.

At that point you're left with two questions. The first is whether multi-armed
bandit is a good model for website optimization. The second is whether there
are any operational advantages to one approach over the other.

The answer to the first, for traditional multi-armed bandit algorithms, is a
sound no. While relative conversion rates remain comparable, absolute
conversion rates vary for all sorts of reasons. (Time of day, day of week,
longer cycles, external website improvements, etc.) That was my big criticism
the last time this discussion popped up, and there was no good answer to it.

 _But_ behind the scenes Noel (the CEO of Myna) and I have been in a
discussion. I believe that I can produce a multi-armed bandit algorithm that
has logarithmic regret and will work in the face of ever varying conversion
rates, as long as there is a consistent relative difference between the
versions.

At that point you are left with the operational differences. If you want to
test at a granularity of "user", a website often has to put most of its active
population into the test, before collecting enough data to know which version
was good. With A/B testing you will eventually put everyone into the better
version. With a multi-armed bandit, half your current users are stuck in the
wrong version.

In effect you have a limited population of levers available to pull, and have
to pull a good chunk of them before you start getting feedback on the rewards.

You can avoid this problem by testing at the granularity of an impression.
Whether or not this is OK to do is a business question, and the answer will
not always be yes. If it is not, then a multi-armed bandit approach is
inappropriate to use.

~~~
sparsevector
> I believe that I can produce a multi-armed bandit algorithm that has
> logarithmic regret and will work in the face of ever varying conversion
> rates, as long as there is a consistent relative difference between the
> versions.

That sounds interesting. As you undoubtedly know, there is a lot of literature
on multi armed bandit problems and also on the non-stochastic multi armed
bandit problem. The latter has the advantage of not assuming anything about
the way in which the sequence of rewards is generated.

There is a a line of work in the non stochastic setting which sounds related
to what you are describing. It's called the problem of tracking the best
expert: <http://www.cse.ucsc.edu/~mark/papers/track-long.ps> The idea in this
problem is to do well as compared not to the best fixed action but rather the
best piecewise constant sequence of actions. There are many variations of this
problem, but the basic idea is that you can have low regret as compared to a
changing sequence of actions so long as that sequence doesn't change too
"quickly" for some definition of quickly.

~~~
btilly
I know there is a lot of literature, which I'm unfortunately less familiar
with than I'd like to be.

Here is the specific problem that I am interested in.

Suppose we have a multi-armed bandit where the probability of reward are
different on every pull. The distribution of rewards are not known. It is
known that there is one arm which, on every pull, has a minimum percentage
chance by which it is more likely to produce a reward if pulled than any other
arm. Which arm this is is unknown, as is the minimum percentage.

In other words payoff percentages change quickly, but the winner is always a
winner, and your job is to find it. I have an algorithm that succeeds with
only logarithmic regret. I have no idea whether anyone else has come up with
the same idea, and there is so much literature (much of which I presume is
locked up behind paywalls) that I cannot easily verify whether anyone else has
looked at this variant.

~~~
sparsevector
I think what you're describing is basically the non-stochastic setting. The
idea there is to assume absolutely nothing about the rewards (they can be
generated by quickly changing distributions or even an adversary) but still
try to do as well as the best single arm. The standard algorithm for this
problem is the multiplicative weights update method, also called the
exponentially weighted average method or the hedge algorithm. It's a really
simple method where you keep a probability distribution over the arms and then
reweight the probabilities at each time step using something proportional to
the exponential of the rewards of the arms. If you run it for T time steps on
a problem involving N arms, it gives you regret O(sqrt(T log N)). This is
actually the best possible bound without additional assumptions. This is a
good paper on it: <http://cns.bu.edu/~gsc/CN710/FreundSc95.pdf>

There are many, many different variations of the method and result. For
example, you can show improved results assuming different things about the
rewards. For exapmle, you can show O(log(T)) regret with certain additional
assumptions. This is a really good book on different non-stochastic online
learning problems:
[http://www.ii.uni.wroc.pl/~lukstafi/pmwiki/uploads/AGT/Predi...](http://www.ii.uni.wroc.pl/~lukstafi/pmwiki/uploads/AGT/Prediction_Learning_and_Games.pdf)

~~~
btilly
The article is interesting, but assumed you get to pull all of the arms at
once, and see all of the results. That's different from the website
optimization problem.

The book looks like it would take a long time to work through, and they don't
get to the multi-armed bandit problem until chapter 6. It goes into my todo
list, but is likely to take a while to get off of it...

~~~
sparsevector
Yes, I realized after posting that this is probably a better paper to link to:
<http://cseweb.ucsd.edu/~yfreund/papers/bandits.pdf> The algorithms for the
partial information setting are sometimes surprisingly similar to the
algorithms where you see all the results. The algorithm in the paper linked
above is essentially the same algorithm but with a small exploration
probability. The regret bound gets worse by a factor of sqrt(N), however.

------
lukis100
From a statistics perspective, a few things are missing. Most importantly, a
discussion of statistical power and a discussion of why exactly it is that
they test until statistical difference is found. Every scientist knows that if
you test a big enough sample, you are more than likely going to find a
statistical difference, regardless of whether it actually exists. Hence, using
only that as your heuristic for what makes one algorithm better than another
is not very useful.

~~~
paraschopra
Agreed that statistical power wasn't used to calculate how long test should
have been run. I know statistical significance can be found earlier if you are
"looking" for it, that is why I ran the simulations until it was found at
least 10 times. (I know it is not the most scientific way, but I used it as a
heuristic; I don't know how to use statistical power in case of MAB. Probably
'statistical power' concept is not valid at all for MAB. Need to study more.)

------
jgrahamc
This blog post is describing epsilon-greedy exploration/exploitation as
applied to A/B testing. It's not really multi-armed at all, since the multi
would imply > 2\. The real beauty of exploration/exploitation is the ability
to go beyond two simultaneous tests. Although they do show a test with three
options, this would be far better if it emphasized the multi more.

------
SeanDav
I would like to have seen more tests vs the MAB-90 algorithm as this appeared
the most interesting one to me. Also the article appears to imply that A/B
testing is useful in the situation where you want a cleaner statistically
significant result which I believe is actually a false argument.

~~~
paraschopra
> A/B testing is useful in the situation where you want a cleaner
> statistically significant result which I believe is actually a false
> argument.

Why do you say so? Can you elaborate?

~~~
SeanDav
Both methods can supply statistically significant results, there is no reason
that A/B testing is fundamentally superior in getting a statistically
significant result, although arguably A/B testing is more straight forward.

~~~
paraschopra
True, both methods can provide statistical significance. My only point is that
in general A/B testing may require less number of visitors as compared to A/B
for same amount of confidence.

------
m104
Talk about being too smart to solve the problem... The article is comparing
apples to concrete blocks. When you're hungry, you want apples. When you're
building a retaining wall, you want concrete blocks. One can't substitute for
the other.

Statisticians, academics, math enthusiasts, and People Who Spend Too Much Time
Building Tables and Graphs want concrete blocks. They want to build a case or
a report or something that suggests a point using a mathematical model. They
deal with "statistical significance" because that's how the tools they
understand work. Thus A/B testing.

Web workers, businesses, advertisers, people looking for better conversion
rates, etc. want to eat. They want apples. Talking about statistical
significance, trial sizes, and frequency measures don't help if they don't get
more apples. The article showed that MAB gets more apples than A/B and by a
wide margin.

Comparing the two isn't just missing the point, it's drawing an equivalency
between "statistical significance" (whatever the tools say it means) and
actual _real_ _world_ significance, a.k.a. better results. If A/B testing
fails to even simulate better results, the proper word for that is _worse_.

(And don't even get me started about trying to run a statistical experiment
until significant results are found... someday that practice will be labeled
as fraud.)

------
rfergie
My take on bandit algorithms vs A/B testing:

If you have a question where the answer can be used to inform other decisions
(e.g. Which is more important for our customers, price or support level?) then
an A/B test will give you a clear answer that you can apply elsewhere.

If you don't really care what the answer is and just want the best option
(e.g. should the hero shot be looking up or down) then a bandit is the way to
go.

------
arkitaip
Reading the conclusions it seems that each approach - a/b testing vs multi-
armed bandit - has its merits. Maybe Visual Website Optimizer should be method
agnostic and offer customers the choice of method?

~~~
paraschopra
Yes, we have been discussing this internally. Providing both options is great,
but that assumes customers have a complete understanding of when to use which
method.

------
oulipo
1\. Could you gist your Python code?

2\. You do not indicate whether you’ve run enough simulations to attain a high
statistical significance (over the meta- result that each campaign statistical
significance has been attained)

3\. Correct me if I’m wrong, but I don’t think A/B testing should be run
“until statistical significance has been attained” because it foils the
purpose of a statistical test, that is, you should rather define before hand a
number of iterations, wait until they have been completed, and then check
whether it gives you statistical significance

4\. The purpose of the epsilon-greedy algorithm is that it can be run for very
long duration without knowing which statistical significance is best because
it will adapt the most commonly seen version of your site to this particular
version (ie it is made to be run almost indefinitely, and will -- in effect --
automatically remove the non-working versions after a while!

------
msellout
A/B testing is linear regression against a single binary variable. There are
many better techniques. Multi-armed bandit is just one of them. A more direct
comparison would be simply using a continuous independent variable, or at
least one with more than two values. You can have more variables pretty
easily, too.

Of course you get statistical significance with fewer observations if you have
fewer variables to model. Although you'd get even significance with even fewer
observations if you had a continuous independent variable. But more variables
produces a higher r-squared. As in, you can explain more of the variation in
your conversion rate.

The prevalence of A/B testing demonstrates a sad lack of statistics training.
It's a problem in many science fields as well, especially biology.

------
snorkel
Was this analysis based on testing only two variations, A vs. B, using A/B
testing then MAB?

If so then that ignores one of the best advantages of MAB is you can compare
variations A, B, C, D, E, F, ... all simultaneously, where if using A/B
testing you'd have to coordinate a tournament between all of those variations
to determine the best of the bunch ... so seems it would take just as long
with A/B testing (as thus a large sample population) to arrive at knowing the
best variation from many choices.

------
moconnor
I'm confused: surely the right way to compare would be to say the A/B tests
pick the best option 100% of the time after 'reaching statistical
significance'[1] - this is at least apples to apples.

[1] Like others here, I believe testing until you see significance is Doing It
Wrong.

------
TimPC
I found it very interesting that MAB-90 was outperforming RAND on both
convergence speed and average conversion rate in the first experiment. This
seems a very interesting result, albeit not the one you might be looking for
in the article. I find it rather disturbing that this dataset was removed in
later experiments. It actually seems to provide a very valuable insight
namely: Multi-Armed Bandit strategies might be better than A/B testing but the
exploration factor should be far higher because of the need for rapid
convergence to significance.

------
scottschulthess
It depends on how much traffic you have. If you have enough for 10% to quickly
provide statistically significant amounts of data, then there is no need to
provide more traffic to that variation to determine whether it is worthwhile
or not - you can split 90% to the already well performing version.

------
kennethcwilbur
This argument is wrong on so many levels.

------
guelo
My imaginary ideal would be a genetic algorithm that mutates the HTML and CSS
in random ways. You'll probably have users with completely unusable pages but
that's the price you pay for finding the optimal solution as it converges over
time, probably a really, really long time.

------
loup-vaillant
I cannot help but think that under a sufficiently precise set of assumptions
(prior knowledge about the various one-armed-bandits, cost of losses, gain of
winnings, expectations of being still in business in the future, etc), there
is a provably optimal strategy.

Isn't there?

------
yzhou
From a EE's point of view, what the OP said is that there's too much noise
involved in multi-armed bandit algorithm. If that's the case, a low pass loop
filter is needed. It's kind of like locking a PLL to a reference clock right?

------
it
Maybe a better way to do the exploitation phase would be to sample from the
different choices with probabilities proportional to their expected payoff
instead of just choosing the one with maximum expected payoff.

------
its_so_on
( _From reading the post quickly, I'm not left with the impression that the
author understands the research they are 'refuting'. (Which I also only read
quickly and cannot help understand.) This is just my impression_.)

~~~
paraschopra
Of course, I don't claim to understand intricacies of MAB _completely_. This
is on my understanding of this algorithm in last couple of days, but I'd like
to be illuminated by someone more knowledgable if there's anything wrong with
my understanding. In fact, that would be quite helpful for our customers as
well.

