Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Multi-Armed Bandit Problem and Its Solutions (2018) (lilianweng.github.io)
92 points by headalgorithm on Feb 5, 2019 | hide | past | favorite | 24 comments


Practical take: Thompson sampling is great. Use it if your main focus is getting things done. I found that most literature about MABs concentrates on various (mostly not too useful) approaches to MABs instead of on getting stuff done. Which is of course fine if your intent is to study MABs.

Thompson sampling is easy to implement, predictable, stable, and you can leave your experiments running forever without supervision. It has very few drawbacks.

With a combination of two HyperLogLogs and a Bloom filter, you can also have a really nice scalable distributed implementation (HLLs and BFs combine in an idempotent manner, so updates are lock-free), albeit at the cost of low precision early on as the number of samples is low.

Don't use a library. Thompson sampling itself is not difficult (specifically, 6 lines, as I'm looking at my Clojure code), you will need to understand it very well anyway, and most libraries will include lots of superfluous garbage. Most of the work is not with Thompson sampling, but with processing your events, and that is something which is application-specific anyway and can't be abstracted well.


I agree. I did some research out of curiosity in the past about multi-armed bandits. The optimal solution is quite involved, and not worth the effort of studying in my opinion.

Thompson sampling, on the other hand... I'm in love. Clean, simple, straight-forward, and so close to the optimal solution that you will never notice the difference. If you don't understand how to implement it, the stuff that you need to learn (mostly about the beta distribution) is generally useful and well worth studying. Though, after several years now, I'm still looking for an excuse to apply it somewhere. I can't get this method out of my head.


I'm a bit confused, as far as I know, Thompson Sampling is optimal (insofar as it minimizes regret in expectation).


> Thompson sampling itself is not difficult (specifically, 6 lines, as I'm looking at my Clojure code),

Going through "A tutorial on Thompson Sampling" and being eager to hone my skills in Clojure, I would be very interested in reading this code if there would be a repository online or a gist you could share with us.

Best Regards.


One of my touchstones for improvement in science being a real thing is that "I will run a multi-armed bandit study with the following 'arms' for the bandit, and the following procedure for developing more 'arms': ..." becomes an acceptable study to run.

(There are some very degenerate cases that are allowed; medical trials can fail and succeed so spectacularly that they can skip to the end of the given phase, for instance, which is sort of a degenerate case of a multi-armed bandit approach. But it's a rare exception.)

We've suffered a lot from the phrase "the scientific method" because of the article "the". The name of the game is to increase and decrease confidence in hypotheses by any legitimate statistical method, not to follow "the" method. Science suffers immensely as a result of the extreme requirement to "call your shots" in advance, which contributes to all sorts of problems. The current system almost entirely forbids any sort of exploration; it is sort of snuck in the side door, rather than officially supported.

(It's very analogous to the problems with waterfall design in our industry; the absolute worst times to estimate how expensive a problem is, how likely the solution is to solve it, how the solution will integrate with the rest of the world, and all of the other interesting engineering issues is at the very beginning of the project, when you know the least. One of the virtues of an iterative process is your ability to react to what you learn; waterfall is such a problem because you still inevitably learn these things but the process forbids you from exploiting that knowledge.)


> Science suffers immensely as a result of the extreme requirement to "call your shots" in advance, which contributes to all sorts of problems.

Science is currently suffering a 50+% replication failure crisis as a result of not following "the" method. Studies forced to preregister have shown a dramatic reduction in false positives.


A non-trivial component of the replication failure is the extreme incentives to cheat, p-hacking and friends, because if you call your shot, and reality says "you're wrong", the real current system, the one that actually determines who gets promotions and more grant money, says that scientist fails.

Sure, the pretend system says this scientist has done noble and laudable work by showing some hypothesis was wrong, but who cares what the pretend system says? Certainly not anyone who wants promotions (or even just job security) and grant money.


Wikipedia captures the essence of the Scientific Method:

"...It involves formulating hypotheses, via induction, based on such observations; experimental and measurement-based testing of deductions drawn from the hypotheses; and refinement (or elimination) of the hypotheses based on the experimental findings. These are principles of the scientific method, as distinguished from a definitive series of steps applicable to all scientific enterprises."

https://en.wikipedia.org/wiki/Scientific_method

There are many ways to implement this method, but if you leave out a single element, you're doing something other than science.


As responses to criticisms of the current scientific method go, "but the current scientific method is the current scientific method!" isn't really that helpful. I know what is; I'm saying it should be something different.


You wrote: "The name of the game is to increase and decrease confidence in hypotheses by any legitimate statistical method, not to follow "the" method."

The definition I cited says nothing about the particulars - it just enumerates three essential elements.

What would you propose instead as general principles for a scientific method?


See my other nice long response from before this post.


I think you are missing the propose of the hypothesis step. The hypothesis let's you design the experiment correctly. If you run an experiment without a hypothesis, you don't know what variables to control for.

Multi armed bandit is more like math anyway.


"with the following 'arms' for the bandit, and the following procedure for developing more 'arms'"

There's nothing wrong with that as a hypothesis.


>There's nothing wrong with that as a hypothesis.

Well it's not Popperian, so it is controversial in that regard.

I'm actually completely fine with the first part of your hypothesis ("With the following 'arms' for the bandit"). Since an n-armed bandit experiment is just a shorthand for n-1 vs everything else experiments.

However I'm a bit confused/concerned by the procedure for developing more arms. What kind of situation are you imagining where that's useful? (Like I'd really want to hear a concrete example of this style of hypothesis).


I sometimes think about robot controllers this way. Every arm corresponds to an action, meaning a control feedback loop applied to the robot for a short time.

If you find that two actions are both fairly good in a given state, the system might try automatically generating a new action that's some synthesis of the two, either an average or some kind of genetic algorithm crossover.

I don't know if you can have a useful falsifiable hypothesis about individual arms. But you can about the entire system. The hypothesis you want to fail to falsify is that the system gradually improves its performance (measured by some reward function) over time.


"What kind of situation are you imagining where that's useful?"

I think we need to be able to fund some things that are a bit more exploratory. Imagine we're going to test how growing a particular culture of bacteria responds to a hundred chemicals. We can only test, say, 10 at a time for cost or equipment reasons. We know from previous science that the chemicals are related in some way, because organic chemicals have various relationships. Suppose we find as we m.a.b.-explore across our set of 100 possible chemicals that things with benzene rings seem to have whatever effect we're looking for after we've tested only 30. We started with a list of chemicals that have at most one such ring. The experimenter would like to try something with two rings.

The proposal is that there ought to be a lower-weight process for going to some committee or funding board or something and saying "According to these statistics, we believe we should try two benzene rings" and getting fast approval. You follow the multi-armed bandit process after that. Suppose for the sake of argument it actually fails miserably, so you stop trying that quickly in accordance with the process. You tried something really quickly, didn't need new funding, and didn't have to wait a year to run a new study. Maybe next week another grad student has another idea about what the real effect may be, and they run something with one more acetyl group than anything else, and it has 25 times the effect anything else has had up to this point, and in accordance with the multi-armed bandit approach, you begin exploring down that route.

By no means am I proposing that the experimenter just gets to do whatever change they want at whatever point they want. There is a lot of virtue in calling your procedures in advance.

But, we're programmers. We, of all people, should be aware that it is possible to define a "process" that does not consist of a straight set of instructions. Our processes are allowed to contain loops, conditionals, mathematics, etc. Science should be too. The process could be to go to a committee; the process could say "we reserve the right to have 5 free new thing trials in advance"; the process could be all sorts of things.

And we are not obligated to be stupid about the process. We do not need to blindly run m.a.b. trials on human medical trials or anything obivously stupid like that. And there would still be a place for old-fashioned, call-your-shot "We're going to do this exact thing and report what happens". It would just be a lot smaller than it is today.

In this example, after testing 40 out of 100 chemicals and perhaps testing 10 that weren't on the initial plan, we found a significant improvement. In the current system, the study plods on through testing all 100 chemicals, regardless of the information obtained partway through the study, and publishes a paper with those the results and a proposal that, someday, maybe somebody ought to try the additional benzene ring, which then never gets funded because the initial study's results contained no significant results and nobody sees a reason to follow this line of inquiry.

What happens in reality is that if it's cheap, the professor might just try this on their own, but it's unsanctioned and takes away from what they are actually funded to do. We're depending on them to be dedicated enough to the cause to do this experiment despite the fact literally all the concrete incentives are aligned against them. And if it's expensive, it doesn't get done at all; no funding. Scientists need more leeway to explore, but with rigorous procedures for exploring.

I'd actually want to go back and carefully read Popper's original writing before I declared it's not "Popperian". It may not be; I don't know. But I would think it's distinctly possible that what we consider Popperian is actually a bastardization of his original points into this oversimplified view of science. This being an HN posting and not an academic debate, I don't have the time to properly dig into that right now.

Science has calcified around some procedures that are simply not mathematically justifiable. In a more scientific world, scientists would read that (and the justification behind it), and rush to fix their underlying foundational procedures before anything else. The fact that scientists as a whole seem rather blase about the whole problem significantly degrades my respect for them.

(And as they seem nearly equally blase as a whole about the reproducibility crisis, my respect isn't running all that high to begin with. "But jerf, there are many people talking about it and dealing with it." Yes, but that's not the correct response. The correct response is that fixing it is your number one priority. When you know you have a reproducibility problem, but allocate <1% of your resources to fixing that while pouring the remainder of resources into something you just by your own methodology established is a hole in the ground, you're not behaving sanely.)


If this reading is too dense and you want a quick primer on multi-armed-bandit:

Imagine four slot machines. Each time you play, you choose one with a weighted probability, and the weight is how many times it won in the past.

So on your first play, you randomly select a machine. Now imagine that machine wins. It now has a probability of 1 vs the other three with 0. So you play it again. Now it loses. It now has a .5 probability, and the other ones have 0. Your next play will be a 50% chance of the same machine, and 50% chance of one of the others. You keep going this way and eventually you will pick the one that wins the most often most of the time, but sometimes you'll pick one of the others, just to make sure your probabilities are still correct.

How is this useful to a programmer? It's great for A/B tests. For example you come up with four different home pages, and then define success as clicking on the signup link. At first you randomly show a homepage to each visitor, but as you progress, you'll get one homepage that converts better than the others. You'll show that one most often, but sometimes show the other ones to give them a chance to "catch up", just in case your initial traffic happened to be biased.


Sorry, but I think rather than clearing things up, your post causes more confusion. First of all, you're not giving a quick primer on the multi-armed bandit problem, but an attempt at describing of one specific algorithm for it. That distinction is important, because there are several algorithms with different trade-offs, hence the linked post. Second, your description isn't very coherent: probabilities have to add up to 1, and the probability to pick a machine in step two can't simultaneously be both 0% and 50%.

FWIW, the article already contains a very accessible description of what the problem is about: "Imagine you are in a casino facing multiple slot machines and each is configured with an unknown probability of how likely you can get a reward at one play. The question is: What is the best strategy to achieve highest long-term rewards?"


I think you're illustrating ONE way to deal with multi-armed bandits, specifically an epsilon-greedy strategy


Also, for those curious, the name is because the common explanation/metaphor is of a gambler at a row of slot machines which are sometimes referred to as "one-armed bandits".


For those who are interested in multi-armed bandits, you can have a look at this paper for the budget limited version: https://eprints.soton.ac.uk/270806/1/LTT_AAAI2010_Bandit.pdf

extension of that paper: https://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/viewFil...


Tangential, why haven't we all standardized our mathematical notation yet?

This author uses K as the infinite bound count variable in the tuple - yet n is used in literature more frequently for uncounted variables of any unbounded set including Tuples.


Differences in notation between different fields abound. It used to be (probably still is) the math department didn’t talk to the engineering department, so people would reinvent basically the same solution in different contexts. Now people maintain the notation for internal consistency, instead of switching to “universally standard notation”.

Basically, historical consistency trumps cross-disciplinary consistency.


I mean in RL we use n consistently. It's all over the Sutton and Barto book. I'm looking at it now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: