
The multi-armed bandit problem (2012) - navigaid
http://stevehanov.ca/blog/?id=132
======
60654
Multi-armed bandits are also well known in game AI. They got popular with the
introduction of Monte Carlo Tree Search, where MAB are used there to select
which subtrees to search for largest expected payoff, e.g.:

[https://www.aaai.org/ocs/index.php/AIIDE/AIIDE13/paper/view/...](https://www.aaai.org/ocs/index.php/AIIDE/AIIDE13/paper/view/7377/7589)
[https://courses.cs.washington.edu/courses/cse599i/18wi/resou...](https://courses.cs.washington.edu/courses/cse599i/18wi/resources/lecture19/lecture19.pdf)
etc.

For what it's worth the MAB algo in the original post looks like Epsilon
Greedy. It's probably better to look into Upper Confidence Bound variants like
UCB1 that dynamically adjust how much to explore vs exploit, e.g.:

[https://towardsdatascience.com/comparing-multi-armed-
bandit-...](https://towardsdatascience.com/comparing-multi-armed-bandit-
algorithms-on-marketing-use-cases-8de62a851831)

~~~
praptak
UCB on game trees (MCTS) was the first breakthrough that created decently
playing Go programs, if I remember correctly.

~~~
cgearhart
You are correct. MCTS + UCB and other variants were state of the art leading
up to AlphaGo. And even then, MCTS was also used in AlphaGo.

The main change in AlphaGo was using a deep learning network to encode a value
network for fast rollouts and a policy network for move selection (rather than
using the UCB rule). They later removed the value network and rollouts
entirely, but even AlphaZero uses MCTS.

------
err4nt
The purpose of an A/B test isn't to always show the best performing result,
it's to perform a _controlled scientific experiment_ with a control group,
from which you can learn things.

Also, I work in this field and I will just say that people _do_ behave
differently based on traffic source: i.e. users coming from Facebook behave
alike, but different than traffic from Reddit who act similarly to each other.
If you were running a self-optimizing thing like this it _would_ make sense to
split it up by the different traffic sources and handle them separately.

~~~
elehack
Yes. Bandits will often converge more quickly to the optimal strategy, but it
is much more difficult to understand _why_ that strategy is optimal and
generalize from the bandit outcomes to predict future performance and
performance of other strategies.

It isn't impossible - bandits are seeing adoption in medical trials to avoid
precisely the problem discussed - but the standard experiment design and
analysis techniques you learn in a decent college statistics class or
introductory statistics text no longer apply. That's one of the beauties of
A/B testing: while it does require substantial thought to do well, the basic
statistics of the setup are very well-understood at this point.

~~~
orasis
I disagree. I’ve spent a lot of time staring at bandit outcomes and usually
they match some sort of intuition of why a variant might be exceptional.

~~~
comicjk
That could be post-hoc reasoning, though. It would be interesting to pre-
register your hypotheses, or see whether you could tell bandit outcomes from
random ones.

~~~
orasis
Sure it’s post-hoc reasoning, but it doesn’t matter because I’m not trying to
invalidate a hypothesis.

I’m looking for variants that win. When I find one that wins I look at it and
try to add more of the same flavor to the product.

This process works.

~~~
taeric
This is literally the logical fallacy. You could get lucky. Maybe you have
obvious gains to chase. But bad logical arguments are bad because they never
work forever. They are corrupted heuristics that can get you in trouble
without critical thinking.

Edit: added in forever. Phone dropped some wording I originally had. I think.

~~~
orasis
Call it a genetic algorithm if you like. I’m looking for incremental wins in a
world of infinite possibilities, not truth.

~~~
taeric
Incremental wins can still lead to dead ends. My phrasing was off in my post.
I meant to say that the fallacies aren't that the tactics never work. Just
that they can stop working without you really realizing it. A heuristics that
can lead you down a dead end.

By all means, keep doing it if it is working for you. But don't confuse it as
good advice. And stay vigilant.

~~~
orasis
Products exist in human reality not some science paper. There are no absolute
truths, everything dead-ends eventually. It’s like trying to prove that one
set of genes is better than another for future survival - an impossible task.

~~~
taeric
This belies a belief that science doesn't reflect the real world. It
absolutely does.

Again, it may be working in your case. Argument to authority can go a long
way. Even ad hom attacks often exist due to a "smell" of the person speaking.
It is not, however, logically sound and can easily lead to unsupportable
positions.

So, take care. And realize that a lot of the damage of poor practices may be
tangential. For example, a belief that the real world can not be described by
science.

------
founderling
Multi armed bandit algorithms have been used _forever_ to optimize ads,
websites, newsletters etc.

Here is an article from 2013 that describes Googles Multi Armed Bandits algo
integration into Adsense:

[https://www.cmswire.com/cms/customer-experience/google-
integ...](https://www.cmswire.com/cms/customer-experience/google-integrates-
adsense-experiment-objectives-into-content-experience-platform-022422.php)

The first time I stumbled across the term "Multi Armed Bandit" was when I read
Koza's "On the programming of computers by natural selection" in 1992. When I
later got involved in e-commerce projects, it was immediately clear to me that
this was the way to tackle the involved optimization tasks.

~~~
OJFord
OP itself says 'posted six years ago', so should probably have a (2013) - or
thereabouts - in the submission title.

------
ampersandy
We (Facebook) just released our Adaptive Experimentation platform as open
source: [https://github.com/facebook/Ax](https://github.com/facebook/Ax). It
supports bandit optimization for discrete sets of candidates and bayesian
optimization for continuous/very large search spaces.

[https://ax.dev/docs/why-ax.html](https://ax.dev/docs/why-ax.html)

------
XCSme
Just a small nitpick: this doesn't take into account implementation cost. If
you want something dynamic like this it means your app has _read_ access to
all the analytics recorded (or at least the ones needed for optimization).
Most of the times apps only _send_ data to the analytics services, developers
read/analyze them and act based on the data. I personally didn't work on any
apps that were using analytics _read_ access within the app. So, in most cases
it's a lot more than 20 lines to implement an approach like this (hopefully
your analytics platform exposes an API with a read endpoint that you have to
then integrate into your app) compared to A/B testing, where you just show
several versions and then analyze the data and iterate.

~~~
ineedasername
Good point, the article says "do it in 20 lines of code", but implementation
at scale may require a little more engineering than that.

------
mhoad
Previous discussions worth checking out here
[https://news.ycombinator.com/item?id=11437114](https://news.ycombinator.com/item?id=11437114)

and here
[https://news.ycombinator.com/item?id=4040022](https://news.ycombinator.com/item?id=4040022)

~~~
ermik
Besides the tree of these and those comments, do you have any other links to
relevant subjects/writings?

------
_eigenfoo
Bandit algorithms can also be approached from a Bayesian point of view! This
lets you quantify the uncertainty in your estimates (e.g. how uncertain you
are than ad A has a higher CTR than ad B), which a lot of other bandit methods
don't offer.

To toot my own horn a bit, I wrote a blog post about Bayesian bandits:
[https://eigenfoo.xyz/bayesian-bandits/](https://eigenfoo.xyz/bayesian-
bandits/)

------
daenz
It seems close to simulated annealing in the travelling salesman problem. The
basics of it is that you start with a random tour and adjust path segments
incrementally. Most of the time, you choose a new path segment that decreases
the global route cost, but randomly, you choose a new path segment that
increases the global route cost. This random factor is decreased over time, so
the route anneals and settles on a close-to-optimal result.

It shares the same principle of choosing reasonable options most of the time,
but allowing variation to keep from getting stuck in a local optimum.

~~~
opportune
Basically you are comparing one strategy to prevent local optima in one
optimization algorithm to another strategy in a different optimization
algorithm. There are such strategies for basically every optimization problem.

Other examples: momentum in deep learning, tabu search, random search, etc.

The benefit of doing a pure binary A/B test is that the experiment is so
simple that as long as you don't break the cardinal rules (you need a fixed
experiment size! Most people don't even do this. Also you need to not bias
your control/experiment sets) it is easy to get statistically valid results.
When doing multivariate optimization using something that is measured (such as
user engagement) rather than evaluated, you need a good amount of data for
each configuration to evaluate it, or you run the risk of optimizing for
random noise. This is true of even the multi-armed bandit problem: if you vary
the exploitation strategies over time, then confounding temporal variables
(for example, purchases are higher on weekdays because that's when people make
business decisions at work) can invalidate the experiment if not controlled
for.

~~~
marcosdumay
The GP is comparing the algorithms. MAB is exactly simulated annealing with an
exponential temperature.

------
a13n
Also one reason a lot of teams can't do more than two options (A, B, C, D, E,
F, G, etc. testing) is because you need a TON of traffic for it to be
statistically significant.

~~~
comicjk
Statistical significance isn't necessary for deriving value from information!
The point of multi-armed bandit is that you use the best information you
currently have, while also not taking that information too seriously. In a
context where experiments have a cost and you need results, this makes more
sense than gathering more and more data until you meet statistical
significance thresholds.

~~~
opportune
If random noise is more likely to explain the validity of one hypothesis over
another, then your information has very little value.

~~~
wongarsu
Guessing based on uncertain data seems better than guessing randomly? Also if
the signal to noise ratio is low my data is likely just noise, but at the same
time the decision doesn't matter much. The clearer the difference the more
important the decision and the more likely my data is right.

------
dr_dshiv
I published this work in CHI on the use of multiarmed bandits in educational
games. My biggest takeaway was the importance of choosing the right metric --
because our optimization worked too well.

[https://www.researchgate.net/publication/301935710_Interface...](https://www.researchgate.net/publication/301935710_Interface_Design_Optimization_as_a_Multi-
Armed_Bandit_Problem)

~~~
barbarr
This was an interesting read. I played the game, too! Interesting to see how
the optimizer converged on such a large ship size, even though you only
allowed that case to verify whether the optimizer would successfully reject
it.

------
lowdose
Microsoft had good talk about contextual bandits and machine learning. The
business case discussed was webpage conversion optimization which increased by
80%.

[https://www.microsoft.com/en-us/research/blog/new-
perspectiv...](https://www.microsoft.com/en-us/research/blog/new-perspectives-
on-contextual-bandit/)

[https://podcasts.apple.com/nl/podcast/microsoft-research-
pod...](https://podcasts.apple.com/nl/podcast/microsoft-research-
podcast/id1318021537?l=en&i=1000437509373)

------
j2kun
UCB1 is really not that much more complicated than epsilon-greedy. Some
slightly sloppy code I wrote a few years ago, maybe 20 lines of code:
[https://github.com/j2kun/ucb1/blob/master/ucb1.py#L7-L35](https://github.com/j2kun/ucb1/blob/master/ucb1.py#L7-L35)

Sure you have to read a bit more to know why it works, but if you write your
code well you could plug this in without any extra trouble. It's not like you
need a special optimization solver as a dependency.

~~~
nestorD
And, for not much more effort (computing the variance of your samples), you
can use UCB1-tuned [0] which gets rid of the 'c' parameter and tends to be
even better.

I personnaly think that it should replace UCB1 as a baseline when trying
bandit algorithms.

[0]:
[https://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf](https://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf)

~~~
j2kun
It's funny, I had read that paper a few times while learning about bandit
learning, and I never noticed their version, which funnily enough outperforms
vanilla UCB1 in all of their tests!

------
sbov
One thing that is vastly more expensive than picking a winner in your A/B test
is maintaining N variations across M features.

------
seoulsister
Great to see Multi-Armed Bandits getting coverage here. I wrote an article
last year to help with gaining a deeper understanding on the topic:

[https://towardsdatascience.com/bandits-for-recommender-
syste...](https://towardsdatascience.com/bandits-for-recommender-system-
optimization-1d702662346e)

------
astazangasta
>In recent years, hundreds of the brightest minds of modern civilization have
been hard at work not curing cancer.

As a cancer researcher I read this as a just criticism of my work.

------
kurthr
Ahhh, the epsilon greedy strategy... I forgot I had read this until I was a
few paragraphs in. If you have rare positive results, it gives you better
statistics on your highest signal tests, while still evaluating the others
(i.e. more than one alternative).

------
NewsAware
At least for us, most A/B/.. tests require a stable assignment of test to user
as key metrics like retention would obviously skewed by randomly assigning on
each visit/page view.

~~~
jfries
The user id can be used to seed the random number algorithm to achieve that.

~~~
NewsAware
Hmmm but if you change the distribution, this won't be deterministic anymore
unless the randomness only applies to new users entering the test setup.

~~~
leshokunin
Which is necessary. You don’t want to compare two wildly different users. This
is why you need to define criteria for eligibility.

------
everyone
"hundreds of the brightest minds of modern civilization have been hard at work
not curing cancer. Instead, they have been refining techniques for getting you
and me to click on banner ads."

Just out of curiosity... Have you _ever_ purposefully clicked on an ad on the
internet? I honestly dont think I ever have.

ps. I mean an outright overt straight up ad, not, for example, some article
linked on HN that is a thinly veiled promo piece for something.

~~~
orasis
I clicked on an ad for women’s yoga pants once. The retargeting makes the web
a constant stream of delights.

~~~
WrtCdEvrydy
Interesting...

------
moultano
Does anyone have a reference for solving multi-armed bandit problems with a
finite time horizon? I would like something that derives rules or heuristics
for how your explore/exploit tradeoff changes as the horizon approaches.

This seems like an obvious extension, and something that someone should have
worked on given how long this problem has been around, but I've been unable to
find anything on it. Any pointers?

~~~
bhl
What do you mean? Most analyses of multi-armed bandit algorithms assume a
finite time horizon. And if not, they use the doubling trick for infinite time
horizons.

~~~
moultano
Thank you, now I realized that I had misunderstood the notation.

------
gchamonlive
For me the website is down (its 404'ing, accessing from Brazil) , but it can
still be accessed in webarchive:
[https://web.archive.org/web/20190527144648/https://stevehano...](https://web.archive.org/web/20190527144648/https://stevehanov.ca/blog/?id=132)

------
cfarm
MAB seems to find a local maxima subject to input biases whereas an AB test is
aimed to figure out a scientific truth and isolates out all potential biases
in the system. I'd be curious to hear where a MAB approach and an AB test did
not yield the same results and why that happened.

~~~
orasis
AB testing doesn’t eliminate time based biases such as novelty of new
features, special events, pay day, tax day, etc.

~~~
cfarm
Sure - my point is that you could design the experiment such that it could.
It's much more scientific.

------
jedberg
Before I even clicked I knew this was going to be about multi-armed bandits. I
agree with the author -- why don't more people know about this? It's so
effective and easy, and you get results much quicker.

------
Aeolun
How is a bandit related to this?

Based on just the article it could be a multi armed chef.

~~~
masonicb00m
A slot machine is a “one armed bandit”.

------
Causality1
I encounter a/b testing primarily in article headlines, where the choice is
always "accurate, informative title" or "inflammatory sensational bullshit
clickbait" and the clickbait wins every time.

------
messe
Despite being just text, images and tables, the post doesn't appear at all
with js disabled.

