
The Multi-Armed Bandit Problem and Its Solutions (2018) - headalgorithm
https://lilianweng.github.io/lil-log/2018/01/23/the-multi-armed-bandit-problem-and-its-solutions.html
======
jwr
Practical take: Thompson sampling is great. Use it if your main focus is
getting things done. I found that most literature about MABs concentrates on
various (mostly not too useful) approaches to MABs instead of on getting stuff
done. Which is of course fine if your intent is to study MABs.

Thompson sampling is easy to implement, predictable, stable, and you can leave
your experiments running forever without supervision. It has very few
drawbacks.

With a combination of two HyperLogLogs and a Bloom filter, you can also have a
really nice scalable distributed implementation (HLLs and BFs combine in an
idempotent manner, so updates are lock-free), albeit at the cost of low
precision early on as the number of samples is low.

Don't use a library. Thompson sampling itself is not difficult (specifically,
6 lines, as I'm looking at my Clojure code), you will need to understand it
very well anyway, and most libraries will include lots of superfluous garbage.
Most of the work is not with Thompson sampling, but with processing your
events, and that is something which is application-specific anyway and can't
be abstracted well.

~~~
Matumio
I agree. I did some research out of curiosity in the past about multi-armed
bandits. The optimal solution is quite involved, and not worth the effort of
studying in my opinion.

Thompson sampling, on the other hand... I'm in love. Clean, simple, straight-
forward, and so close to the optimal solution that you will never notice the
difference. If you don't understand how to implement it, the stuff that you
need to learn (mostly about the beta distribution) is generally useful and
well worth studying. Though, after several years now, I'm still looking for an
excuse to apply it somewhere. I can't get this method out of my head.

~~~
joshuamorton
I'm a bit confused, as far as I know, Thompson Sampling is optimal (insofar as
it minimizes regret in expectation).

------
jerf
One of my touchstones for improvement in science being a real thing is that "I
will run a multi-armed bandit study with the following 'arms' for the bandit,
and the following procedure for developing more 'arms': ..." becomes an
acceptable study to run.

(There are some very degenerate cases that are allowed; medical trials can
fail and succeed so spectacularly that they can skip to the end of the given
phase, for instance, which is sort of a degenerate case of a multi-armed
bandit approach. But it's a rare exception.)

We've suffered a lot from the phrase "the scientific method" because of the
article "the". The name of the game is to increase and decrease confidence in
hypotheses by any legitimate statistical method, not to follow "the" method.
Science suffers immensely as a result of the extreme requirement to "call your
shots" in advance, which contributes to all sorts of problems. The current
system almost entirely forbids any sort of exploration; it is sort of snuck in
the side door, rather than officially supported.

(It's very analogous to the problems with waterfall design in our industry;
the absolute worst times to estimate how expensive a problem is, how likely
the solution is to solve it, how the solution will integrate with the rest of
the world, and all of the other interesting engineering issues is at the very
beginning of the project, when you know the least. One of the virtues of an
iterative process is your ability to react to what you learn; waterfall is
such a problem because you still inevitably learn these things but the process
forbids you from exploiting that knowledge.)

~~~
naasking
> Science suffers immensely as a result of the extreme requirement to "call
> your shots" in advance, which contributes to all sorts of problems.

Science is currently suffering a 50+% replication failure crisis as a result
of not following "the" method. Studies forced to preregister have shown a
dramatic reduction in false positives.

~~~
jerf
A non-trivial component of the replication failure is the _extreme_ incentives
to cheat, p-hacking and friends, because if you call your shot, and reality
says "you're wrong", the _real_ current system, the one that actually
determines who gets promotions and more grant money, says that scientist
fails.

Sure, the pretend system says this scientist has done noble and laudable work
by showing some hypothesis was wrong, but who cares what the pretend system
says? Certainly not anyone who wants promotions (or even just job security)
and grant money.

------
jedberg
If this reading is too dense and you want a quick primer on multi-armed-
bandit:

Imagine four slot machines. Each time you play, you choose one with a weighted
probability, and the weight is how many times it won in the past.

So on your first play, you randomly select a machine. Now imagine that machine
wins. It now has a probability of 1 vs the other three with 0. So you play it
again. Now it loses. It now has a .5 probability, and the other ones have 0.
Your next play will be a 50% chance of the same machine, and 50% chance of one
of the others. You keep going this way and eventually you will pick the one
that wins the most often most of the time, but sometimes you'll pick one of
the others, just to make sure your probabilities are still correct.

How is this useful to a programmer? It's _great_ for A/B tests. For example
you come up with four different home pages, and then define success as
clicking on the signup link. At first you randomly show a homepage to each
visitor, but as you progress, you'll get one homepage that converts better
than the others. You'll show that one most often, but sometimes show the other
ones to give them a chance to "catch up", just in case your initial traffic
happened to be biased.

~~~
Zbrush
Sorry, but I think rather than clearing things up, your post causes more
confusion. First of all, you're not giving a quick primer on the multi-armed
bandit _problem_ , but an attempt at describing of one specific _algorithm_
for it. That distinction is important, because there are several algorithms
with different trade-offs, hence the linked post. Second, your description
isn't very coherent: probabilities have to add up to 1, and the probability to
pick a machine in step two can't simultaneously be both 0% and 50%.

FWIW, the article already contains a very accessible description of what the
problem is about: "Imagine you are in a casino facing multiple slot machines
and each is configured with an unknown probability of how likely you can get a
reward at one play. The question is: What is the best strategy to achieve
highest long-term rewards?"

------
tdgunes
For those who are interested in multi-armed bandits, you can have a look at
this paper for the budget limited version:
[https://eprints.soton.ac.uk/270806/1/LTT_AAAI2010_Bandit.pdf](https://eprints.soton.ac.uk/270806/1/LTT_AAAI2010_Bandit.pdf)

extension of that paper:
[https://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/viewFil...](https://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/viewFile/4865/5532)

------
AndrewKemendo
Tangential, why haven't we all standardized our mathematical notation yet?

This author uses K as the infinite bound count variable in the tuple - yet n
is used in literature more frequently for uncounted variables of any unbounded
set including Tuples.

~~~
rwilson4
Differences in notation between different fields abound. It used to be
(probably still is) the math department didn’t talk to the engineering
department, so people would reinvent basically the same solution in different
contexts. Now people maintain the notation for internal consistency, instead
of switching to “universally standard notation”.

Basically, historical consistency trumps cross-disciplinary consistency.

~~~
AndrewKemendo
I mean in RL we use n consistently. It's all over the Sutton and Barto book.
I'm looking at it now.

