The multi-armed bandit problem (2012)

60654 · on May 27, 2019

Multi-armed bandits are also well known in game AI. They got popular with the introduction of Monte Carlo Tree Search, where MAB are used there to select which subtrees to search for largest expected payoff, e.g.:

https://www.aaai.org/ocs/index.php/AIIDE/AIIDE13/paper/view/... https://courses.cs.washington.edu/courses/cse599i/18wi/resou... etc.

For what it's worth the MAB algo in the original post looks like Epsilon Greedy. It's probably better to look into Upper Confidence Bound variants like UCB1 that dynamically adjust how much to explore vs exploit, e.g.:

https://towardsdatascience.com/comparing-multi-armed-bandit-...

praptak · on May 27, 2019

UCB on game trees (MCTS) was the first breakthrough that created decently playing Go programs, if I remember correctly.

cgearhart · on May 27, 2019

You are correct. MCTS + UCB and other variants were state of the art leading up to AlphaGo. And even then, MCTS was also used in AlphaGo.

The main change in AlphaGo was using a deep learning network to encode a value network for fast rollouts and a policy network for move selection (rather than using the UCB rule). They later removed the value network and rollouts entirely, but even AlphaZero uses MCTS.

err4nt · on May 27, 2019

The purpose of an A/B test isn't to always show the best performing result, it's to perform a _controlled scientific experiment_ with a control group, from which you can learn things.

Also, I work in this field and I will just say that people _do_ behave differently based on traffic source: i.e. users coming from Facebook behave alike, but different than traffic from Reddit who act similarly to each other. If you were running a self-optimizing thing like this it _would_ make sense to split it up by the different traffic sources and handle them separately.

citrablue · on May 27, 2019

What you said is correct, but I'd like to point out that controlled scientific experiments may not be the right approach for e.g. optimizing conversions on a website.

The reason is that websites are a dynamic environment. All things equal, better controlled experiments are great. However, visitor behavior, especially when from dynamic sources (google serps change weekly), changes all the time.

And that's why I prefer MAB over A/B tests -- A/B tests don't adapt to a dynamic system, so we often wish away the changes in the system to use it. Does anyone go back and re-test their biggest wins?

err4nt · on May 27, 2019

> Does anyone go back and re-test their biggest wins?

Yes, absolutely! We do research first, then come up with simple, well-controlled tests. Once we have a winner we can either lock it in, but often we continue to research and experiment on the new knowledge we gained. A hefty minority of the tests I implement build on past wins to further flesh out what works and what doesn't with knowledge and the data to back it up.

edmundsauto · on May 27, 2019

How do you do knowledge transfer when you're documenting what you've learned? I have found even simple changes often don't reproduce on the same site. It feels like a easy trap for an expert practitioner to fall into -- the belief they can predict the outcome of a test based on a different test they ran.

Sometimes, yes, but often it's not for the reasons people think. This seems very much in Danny Kahneman land -- we can't trust what our brain is telling us, that this experience will have similar results in this other context.

I recently worked at a place that ran > 20 websites, huge traffic numbers, with around 100 simultaneous tests. We would poll each employee about their guess for the test winner. The results were about on par with everyone randomly guessing.

JohannesH · on May 27, 2019

Very interesting. It seems to me that doing incremental work like this might end up in a local minima/maxima. Do you have any advice on how to avoid pitfalls like that? Are you testing radically different ideas along with your incremental improvements?

orasis · on May 27, 2019

From a workflow perspective, MAB is a bit difficult to find radical improvements from. The radical improvements come from fundamental design changes, of which it would be very expensive to create a bunch of radically different variants.

MAB is best used where you can generate a bunch of variants cheaply and hope for a 30% gain.

JohannesH · on May 28, 2019

Cool, thanks.

elehack · on May 27, 2019

Yes. Bandits will often converge more quickly to the optimal strategy, but it is much more difficult to understand why that strategy is optimal and generalize from the bandit outcomes to predict future performance and performance of other strategies.

It isn't impossible - bandits are seeing adoption in medical trials to avoid precisely the problem discussed - but the standard experiment design and analysis techniques you learn in a decent college statistics class or introductory statistics text no longer apply. That's one of the beauties of A/B testing: while it does require substantial thought to do well, the basic statistics of the setup are very well-understood at this point.

edmundsauto · on May 27, 2019

But for results to generalize or to understand why, the confounders must be accounted for in the randomization. This is really hard to do well -- there are often subtle influences that aren't sufficiently understood how they impact these non-linear systems. What makes someone convert? A million different factors; changing the color of a button in one context doesn't necessarily tell me much about how people would respond to that experience in another context.

It's easy to underestimate how complex things are, because we only see some superficial aspects of e.g. a user/software interaction model. This flaw is down to how our brains work -- ref "What you see is all there is".

orasis · on May 27, 2019

I disagree. I’ve spent a lot of time staring at bandit outcomes and usually they match some sort of intuition of why a variant might be exceptional.

comicjk · on May 27, 2019

That could be post-hoc reasoning, though. It would be interesting to pre-register your hypotheses, or see whether you could tell bandit outcomes from random ones.

orasis · on May 27, 2019

Sure it’s post-hoc reasoning, but it doesn’t matter because I’m not trying to invalidate a hypothesis.

I’m looking for variants that win. When I find one that wins I look at it and try to add more of the same flavor to the product.

This process works.

taeric · on May 28, 2019

This is literally the logical fallacy. You could get lucky. Maybe you have obvious gains to chase. But bad logical arguments are bad because they never work forever. They are corrupted heuristics that can get you in trouble without critical thinking.

Edit: added in forever. Phone dropped some wording I originally had. I think.

orasis · on May 28, 2019

Call it a genetic algorithm if you like. I’m looking for incremental wins in a world of infinite possibilities, not truth.

taeric · on May 28, 2019

Incremental wins can still lead to dead ends. My phrasing was off in my post. I meant to say that the fallacies aren't that the tactics never work. Just that they can stop working without you really realizing it. A heuristics that can lead you down a dead end.

By all means, keep doing it if it is working for you. But don't confuse it as good advice. And stay vigilant.

orasis · on May 28, 2019

Products exist in human reality not some science paper. There are no absolute truths, everything dead-ends eventually. It’s like trying to prove that one set of genes is better than another for future survival - an impossible task.

taeric · on May 28, 2019

This belies a belief that science doesn't reflect the real world. It absolutely does.

Again, it may be working in your case. Argument to authority can go a long way. Even ad hom attacks often exist due to a "smell" of the person speaking. It is not, however, logically sound and can easily lead to unsupportable positions.

So, take care. And realize that a lot of the damage of poor practices may be tangential. For example, a belief that the real world can not be described by science.

edmundsauto · on May 27, 2019

Isn't this problem also an issue when people talk about transferring what they learn from one test to another test? That is frequently cited as a benefit of A/B testing.

arendtio · on May 27, 2019

> The purpose of an A/B test isn't to always show the best performing result

The key part is 'not always'.

Using A/B tests just to find the better performing version is a valid use case too. And if you have the traffic, multi-armed bandits are a nice way of automating the whole procedure. Even scientifically speaking there is nothing wrong with them. Their biggest issue is that they require a lot of traffic for significant results.

joshuamorton · on May 27, 2019

In practice, MAB should require less traffic than an A/B test for the same level of significance. Although it's much more difficult to rigorously describe that significance level with a MAB.

bduerst · on May 28, 2019

Wouldn't a simple solution be to create a handler that runs a MAB for each traffic source?

If you're worried about the site changing for people between visits from different traffic sources, you can cookie their MAB/test-assignment on the first visit.

err4nt · on May 29, 2019

If your ultimate goal is just to convert, convert, convert, and you can generate enough different content in your content strategy to sell different types of people (usually this content strategy is what's missing) you can set up tracking so it tracks everything you can imagine (did the user scroll down and then scroll all the way back up to the top? How long was the page loaded before the user scrolled the first time? and all kinds of stuff is trackable) you can split the traffic by their origin (because traffic sources seem to behave alike) and I know a guy who does Lead Qualification and his setup he can predict with a 90% accuracy whether a user will convert or not after 1000 visitors from that traffic source. And he's tracking about 30 variables to come to that prediction.

These things are easy to track, but if you can track it but your content strategy has no alternate content or anything to do, having that ability is useless.

Some things that you could do to alter a sales funnel to try to convert a customer:

- if you think they need more priming before they're ready to buy, add more pages into the funnel for that user

- if you think they're in a place to buy immediately, make sure you show them a Call to Action right then so they can buy

- swap entire blocks of text based on what kind of user we think you are

- change the length of pages, (add or rearranging elements on the page to highlight different things to different users based on what we think they care about more)

- etc

You can also often just ask users to self-identify a 'role' that is useful for content strategy, like "Are you a [student] or [teacher]" and people will click the button, and even expect the page to change and customize to that experience. All this stuff is so easy to do technologically, but you have to have a content strategy that actually makes use of the insights available.

throwawaymath · on May 27, 2019

> Also, I work in this field and I will just say that people _do_ behave differently based on traffic source: i.e. users coming from Facebook behave alike, but different than traffic from Reddit who act similarly to each other.

Out of curiosity, what is your hypothesis for explaining this difference in behavior? Would you say it's primarily due to differing contexts in which a link is posted, or differing populations on each platform? Or maybe these both contribute about equally?

Stated another way: would you expect the same individual to behave differently coming from Facebook versus coming from reddit, if they happen to be a user of both?

orasis · on May 27, 2019

Different demographics, different intent, and different mental context all play a factor.

parksy · on May 28, 2019

Not a statistician by any means, but could traffic source be a factor that's evaluated alongside conversion by a bandit algorithm when calculating the chance to show a particular option? Or other factors as well (detected device capabilities, users location, etc?)

These could just be weighting factors so instead of a single % chance per option, every time there's a successful interaction the victory is spread across the factors for that option.

New users could be shown the option where the chance for each option is weighted by the factors they match. Is this a valid approach or would it introduce some kind of selection bias?

orasis · on May 28, 2019

Yes, of course. You can use a contextual MAB and include the traffic source in the context. Contextual MABs are much more complicated and expensive to implement though.

paulddraper · on May 28, 2019

> The purpose of an A/B test isn't to always show the best performing result, it's to perform a _controlled scientific experiment_ with a control group, from which you can learn things.

Occasionally there is pure scientific interest.

But far more frequently, the purpose of the A/B is to optimize the outcome.

This is why Google Analytics has exclusively chosen multi-armed bandit for its its A/B test framework.

mobjack · on May 28, 2019

It doesn't have to be a pure science interest.

If you want to use the results from one test to inform what to test next, then A/B tests are better optimizing for truth.

Most website changes dont make a significant difference in conversion. If you use MAB, then you dont know if the winner is really better or the result of random variation.

paulddraper · on May 28, 2019

> If you use MAB, then you dont know if the winner is really better or the result of random variation.

No, you know the outcome of MAB is the best.

If it's clearly the best, it converges quickly. If it's less clearly the best, it converges slowly. Either way though, you haven't lost any more conversions that necessary to find that out, or needed to put in non-mathematically fail safes.

antidesitter · on May 27, 2019

That’s just a contextual multi-armed bandit.

founderling · on May 27, 2019

Multi armed bandit algorithms have been used forever to optimize ads, websites, newsletters etc.

Here is an article from 2013 that describes Googles Multi Armed Bandits algo integration into Adsense:

https://www.cmswire.com/cms/customer-experience/google-integ...

The first time I stumbled across the term "Multi Armed Bandit" was when I read Koza's "On the programming of computers by natural selection" in 1992. When I later got involved in e-commerce projects, it was immediately clear to me that this was the way to tackle the involved optimization tasks.

OJFord · on May 27, 2019

OP itself says 'posted six years ago', so should probably have a (2013) - or thereabouts - in the submission title.

ampersandy · on May 27, 2019

We (Facebook) just released our Adaptive Experimentation platform as open source: https://github.com/facebook/Ax. It supports bandit optimization for discrete sets of candidates and bayesian optimization for continuous/very large search spaces.

https://ax.dev/docs/why-ax.html

XCSme · on May 27, 2019

Just a small nitpick: this doesn't take into account implementation cost. If you want something dynamic like this it means your app has read access to all the analytics recorded (or at least the ones needed for optimization). Most of the times apps only send data to the analytics services, developers read/analyze them and act based on the data. I personally didn't work on any apps that were using analytics read access within the app. So, in most cases it's a lot more than 20 lines to implement an approach like this (hopefully your analytics platform exposes an API with a read endpoint that you have to then integrate into your app) compared to A/B testing, where you just show several versions and then analyze the data and iterate.

ineedasername · on May 28, 2019

Good point, the article says "do it in 20 lines of code", but implementation at scale may require a little more engineering than that.

mhoad · on May 27, 2019

Previous discussions worth checking out here https://news.ycombinator.com/item?id=11437114

and here https://news.ycombinator.com/item?id=4040022

ermik · on May 27, 2019

Besides the tree of these and those comments, do you have any other links to relevant subjects/writings?

_eigenfoo · on May 27, 2019

Bandit algorithms can also be approached from a Bayesian point of view! This lets you quantify the uncertainty in your estimates (e.g. how uncertain you are than ad A has a higher CTR than ad B), which a lot of other bandit methods don't offer.

To toot my own horn a bit, I wrote a blog post about Bayesian bandits: https://eigenfoo.xyz/bayesian-bandits/

daenz · on May 27, 2019

It seems close to simulated annealing in the travelling salesman problem. The basics of it is that you start with a random tour and adjust path segments incrementally. Most of the time, you choose a new path segment that decreases the global route cost, but randomly, you choose a new path segment that increases the global route cost. This random factor is decreased over time, so the route anneals and settles on a close-to-optimal result.

It shares the same principle of choosing reasonable options most of the time, but allowing variation to keep from getting stuck in a local optimum.

opportune · on May 27, 2019

Basically you are comparing one strategy to prevent local optima in one optimization algorithm to another strategy in a different optimization algorithm. There are such strategies for basically every optimization problem.

Other examples: momentum in deep learning, tabu search, random search, etc.

The benefit of doing a pure binary A/B test is that the experiment is so simple that as long as you don't break the cardinal rules (you need a fixed experiment size! Most people don't even do this. Also you need to not bias your control/experiment sets) it is easy to get statistically valid results. When doing multivariate optimization using something that is measured (such as user engagement) rather than evaluated, you need a good amount of data for each configuration to evaluate it, or you run the risk of optimizing for random noise. This is true of even the multi-armed bandit problem: if you vary the exploitation strategies over time, then confounding temporal variables (for example, purchases are higher on weekdays because that's when people make business decisions at work) can invalidate the experiment if not controlled for.

marcosdumay · on May 28, 2019

The GP is comparing the algorithms. MAB is exactly simulated annealing with an exponential temperature.

a13n · on May 27, 2019

Also one reason a lot of teams can't do more than two options (A, B, C, D, E, F, G, etc. testing) is because you need a TON of traffic for it to be statistically significant.

comicjk · on May 27, 2019

Statistical significance isn't necessary for deriving value from information! The point of multi-armed bandit is that you use the best information you currently have, while also not taking that information too seriously. In a context where experiments have a cost and you need results, this makes more sense than gathering more and more data until you meet statistical significance thresholds.

opportune · on May 27, 2019

If random noise is more likely to explain the validity of one hypothesis over another, then your information has very little value.

wongarsu · on May 27, 2019

Guessing based on uncertain data seems better than guessing randomly? Also if the signal to noise ratio is low my data is likely just noise, but at the same time the decision doesn't matter much. The clearer the difference the more important the decision and the more likely my data is right.

londons_explore · on May 27, 2019

Go take a look at some solutions to the multi-armed bandit problem. A common trait is assigning value (ie. monetary) to each hypothesis. A weaker positive-outcome hypothesis is worth less. The method takes that into account.

dr_dshiv · on May 27, 2019

I published this work in CHI on the use of multiarmed bandits in educational games. My biggest takeaway was the importance of choosing the right metric -- because our optimization worked too well.

https://www.researchgate.net/publication/301935710_Interface...

barbarr · on May 28, 2019

This was an interesting read. I played the game, too! Interesting to see how the optimizer converged on such a large ship size, even though you only allowed that case to verify whether the optimizer would successfully reject it.

lowdose · on May 27, 2019

Microsoft had good talk about contextual bandits and machine learning. The business case discussed was webpage conversion optimization which increased by 80%.

https://www.microsoft.com/en-us/research/blog/new-perspectiv...

https://podcasts.apple.com/nl/podcast/microsoft-research-pod...

j2kun · on May 27, 2019

UCB1 is really not that much more complicated than epsilon-greedy. Some slightly sloppy code I wrote a few years ago, maybe 20 lines of code: https://github.com/j2kun/ucb1/blob/master/ucb1.py#L7-L35

Sure you have to read a bit more to know why it works, but if you write your code well you could plug this in without any extra trouble. It's not like you need a special optimization solver as a dependency.

nestorD · on May 27, 2019

And, for not much more effort (computing the variance of your samples), you can use UCB1-tuned [0] which gets rid of the 'c' parameter and tends to be even better.

I personnaly think that it should replace UCB1 as a baseline when trying bandit algorithms.

[0]: https://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf

j2kun · on June 10, 2019

It's funny, I had read that paper a few times while learning about bandit learning, and I never noticed their version, which funnily enough outperforms vanilla UCB1 in all of their tests!

60654 · on May 27, 2019

That's a nice paper, thanks for posting this!

sbov · on May 27, 2019

One thing that is vastly more expensive than picking a winner in your A/B test is maintaining N variations across M features.

seoulsister · on May 27, 2019

Great to see Multi-Armed Bandits getting coverage here. I wrote an article last year to help with gaining a deeper understanding on the topic:

https://towardsdatascience.com/bandits-for-recommender-syste...

astazangasta · on May 27, 2019

>In recent years, hundreds of the brightest minds of modern civilization have been hard at work not curing cancer.

As a cancer researcher I read this as a just criticism of my work.

kurthr · on May 27, 2019

Ahhh, the epsilon greedy strategy... I forgot I had read this until I was a few paragraphs in. If you have rare positive results, it gives you better statistics on your highest signal tests, while still evaluating the others (i.e. more than one alternative).

NewsAware · on May 27, 2019

At least for us, most A/B/.. tests require a stable assignment of test to user as key metrics like retention would obviously skewed by randomly assigning on each visit/page view.

jfries · on May 27, 2019

The user id can be used to seed the random number algorithm to achieve that.

NewsAware · on May 27, 2019

Hmmm but if you change the distribution, this won't be deterministic anymore unless the randomness only applies to new users entering the test setup.

leshokunin · on May 27, 2019

Which is necessary. You don’t want to compare two wildly different users. This is why you need to define criteria for eligibility.

everyone · on May 28, 2019

"hundreds of the brightest minds of modern civilization have been hard at work not curing cancer. Instead, they have been refining techniques for getting you and me to click on banner ads."

Just out of curiosity... Have you ever purposefully clicked on an ad on the internet? I honestly dont think I ever have.

ps. I mean an outright overt straight up ad, not, for example, some article linked on HN that is a thinly veiled promo piece for something.

orasis · on May 28, 2019

I clicked on an ad for women’s yoga pants once. The retargeting makes the web a constant stream of delights.

WrtCdEvrydy · on May 28, 2019

Interesting...

moultano · on May 27, 2019

Does anyone have a reference for solving multi-armed bandit problems with a finite time horizon? I would like something that derives rules or heuristics for how your explore/exploit tradeoff changes as the horizon approaches.

This seems like an obvious extension, and something that someone should have worked on given how long this problem has been around, but I've been unable to find anything on it. Any pointers?

bhl · on May 28, 2019

What do you mean? Most analyses of multi-armed bandit algorithms assume a finite time horizon. And if not, they use the doubling trick for infinite time horizons.

moultano · on May 29, 2019

Thank you, now I realized that I had misunderstood the notation.

gchamonlive · on May 27, 2019

For me the website is down (its 404'ing, accessing from Brazil) , but it can still be accessed in webarchive: https://web.archive.org/web/20190527144648/https://stevehano...

cfarm · on May 27, 2019

MAB seems to find a local maxima subject to input biases whereas an AB test is aimed to figure out a scientific truth and isolates out all potential biases in the system. I'd be curious to hear where a MAB approach and an AB test did not yield the same results and why that happened.

orasis · on May 28, 2019

AB testing doesn’t eliminate time based biases such as novelty of new features, special events, pay day, tax day, etc.

cfarm · on May 28, 2019

Sure - my point is that you could design the experiment such that it could. It's much more scientific.

jedberg · on May 27, 2019

Before I even clicked I knew this was going to be about multi-armed bandits. I agree with the author -- why don't more people know about this? It's so effective and easy, and you get results much quicker.

Aeolun · on May 28, 2019

How is a bandit related to this?

Based on just the article it could be a multi armed chef.

masonicb00m · on May 28, 2019

A slot machine is a “one armed bandit”.

Causality1 · on May 27, 2019

I encounter a/b testing primarily in article headlines, where the choice is always "accurate, informative title" or "inflammatory sensational bullshit clickbait" and the clickbait wins every time.

messe · on May 27, 2019

Despite being just text, images and tables, the post doesn't appear at all with js disabled.