

Ask HN: Writing my first multi-armed bandit solution..? - darkxanthos

Any good sources for where I can go or a book I can purchase that delves into this enough conceptually that I can prototype something? It is related to split and A/B tests and web analytics.
======
lylejohnson
From Wikipedia, in case this term was as unfamiliar to others as it was to me:

"In statistics, particularly in the design of sequential experiments, a multi-
armed bandit takes its name from a traditional slot machine (one-armed
bandit). Multiple levers are considered in the motivating applications in
statistics. When pulled, each lever provides a reward drawn from a
distribution associated with that specific lever. The objective of the gambler
is to maximize the sum of rewards earned through a sequence of lever pulls.

In practice, multi-armed bandits have been used to model the problem of
managing research projects in a large organization, like a science foundation
or a pharmaceutical company. Given its fixed budget, the problem is to
allocate resources among the competing projects, whose properties are only
partially known now but may be better understood as time passes.

In the early versions of the multi-armed bandit problem, the gambler has no
initial knowledge about the levers. The crucial tradeoff the gambler faces at
each trial is between "exploitation" of the lever that has the highest
expected payoff and "exploration" to get more information about the expected
payoffs of the other levers."

<http://en.wikipedia.org/wiki/Multi-armed_bandit>

------
svedlin
Here is one good overview:

<http://www.cs.nyu.edu/~mohri/pub/bandit.pdf>

"[...] ε-greedy is probably the simplest and the most widely used strategy to
solve the bandit problem and was first described by Watkins [24]. The
ε-greedy strategy consists of choosing a random lever with ε-frequency, and
otherwise choosing the lever with the highest estimated mean, the estimation
being based on the rewards observed thus far. ε must be in the open interval
(0, 1) and its choice is left to the user. Methods that imply a binary
distinction between exploitation (the greedy choice) and exploration (uniform
probability over a set of levers) are known as semi-uniform methods."

------
verdatel
The theory of multi-armed bandits arises from the study of Markov decision
processes (MDP). This field is also closely studied in and related to
Stochastic Control and Dynamic Programming. A good book on MDP's is "Markov
Decision Processes: Discrete Stochastic Dynamic Programming by Martin
Puterman". It's fairly mathematical and is for grad-level learning. An
alternative is another superb book "Dynamic Programming and Optimal Control"
by Dimitri Bersekas.

A more accessible book is "Approximate Dynamic Programming" by Warren Powell.

I'm not sure if you know what a Markov chain is? Very roughly, any kind of
data which changes its value (sometimes referred to as "state") and jumps to a
new value is a Markov chain. The value that the chain jumps to is determined
probabilistically using some distribution over all the states. When there are
several Markov chains in parallel and each of them contribute to a certain
reward and we are faced with the problem of which Markov chain to "control" or
influence.. then it becomes a multi-armed bandit problem. They are not very
easy to solve, except when certain simplifying assumptions are made. You can
get a good start by looking at some MATLAB code from Prof. Kevin Murphy at UBC
: <http://www.cs.ubc.ca/~murphyk/Software/MDP/mdp.html>

