
Reinforcement Learning – Bandit Problems - oneraynyday
https://oneraynyday.github.io/ml/2018/05/03/Reinforcement-Learning-Bandit/
======
subatomic
I was heavily into reinforcement learning around the turn of the century, and
at the time, "Reinforcement Learning - An introduction" (Barto and Sutton)
[https://mitpress.mit.edu/books/reinforcement-
learning](https://mitpress.mit.edu/books/reinforcement-learning) was an
absolute goldmine for me getting started. I think parts of it are online
somewhere including all their pseudocode and solutions.

[https://mitpress.mit.edu/books/reinforcement-
learning](https://mitpress.mit.edu/books/reinforcement-learning)

~~~
svalorzen
The complete first edition can be found here:
[http://incompleteideas.net/book/ebook/the-
book.html](http://incompleteideas.net/book/ebook/the-book.html)

If you're interested in some well documented C++ implementations of the
algorithms shown in the book, feel free to check out
[https://github.com/Svalorzen/AI-Toolbox](https://github.com/Svalorzen/AI-
Toolbox). I started the project because when I was first reading the book I
had no reference implementation to compare the book to, and personally I learn
better with practical examples, so maybe it can help you too.

~~~
levesque
If you are going to start in RL, you should really consider reading the second
edition even though it is not released yet. I am guessing that Sutton is
getting closer to the finishing line as there have been numerous revisions
already. The second edition has better notation and benefits from the field
having matured a lot since the first book was written.
[http://incompleteideas.net/book/the-
book-2nd.html](http://incompleteideas.net/book/the-book-2nd.html)

------
sharonzhou
I'm surprised there was no octopus to represent the bandits.

In all seriousness, this post makes sense to me, as someone who does RL
research. However, the intuition behind the concepts could be communicated
more clearly. I would reason that this piece is less accessible to those who
have much less knowledge of RL/bandits. Given that it's an introduction, I
presume that's your intended reader, though perhaps writing can also be for
your own edification. Who's your audience?

~~~
oneraynyday
Thanks for the reply! My audience is for someone who is early into their
undergrad degree, doing math or computer science. Which parts do you think is
not communicated clearly? I can make an edit later today :)

~~~
joshuamorton
Oh cool, I've also done a presentation on bandit methods for early-undergrads
(see: [https://gtagency.github.io/2016/experimentation-with-no-
ragr...](https://gtagency.github.io/2016/experimentation-with-no-ragrets), its
missing speaker notes so it looks a bit strange, but that outlines the
structure fairly well). It was also sort of an intro to the beta distribution
and why you should love it, hence the focus on Thompson Sampling.

Some critiques:

\- I feel like your justification/explanation for why this is useful is a bit
lacking. Personally I find framing it in terms of regret-minimzation better
than gain-maximization, even though in practice they're the same. I think it
frames the situation in such a way where you go in knowing that you will have
to pick non-optimal things some, so your job is to learn the underlying
distributions as quickly as possible, instead of trying to pick the best
things. Interestingly, I think thinking of it as gain-maximization leads you
down an epsilon greedy path, whereas regret-minimzation leads you toward
UCB1/Thompson Sampling better. Since you pivot to RL instead of just bandits,
I can kind of understand it, but see my last point.

\- As a general rule, I try to minimize math in undergrad-focused
talks/documents. Even as someone who spends a lot of time explaining
statistical concepts to people, my eyes glaze over when I see `q*(a) = E[Rt|At
= a]`. Obviously you need some and this is just a personal thing. For the most
part I actually think you do a decent job of explaining the equations you use.
At least until the gradient bandit part :P Then it just feels like a textbook
proof excerpt.

\- Nit: You don't fully explain that epsilon-greedy is greedy, except epsilon
of the time. That caught me up for a second.

\- The last thing is that I feel like the motivation and difference between
stationary and nonstationary reward distributions isn't well explained.
Nonstationary rewards don't really "fit" the mental model behind k-armed
bandits a lot of the time. I'm actually curious for a better motivation there,
as I can't articulate one myself.

~~~
oneraynyday
Hey Joshua, thanks so much for the criticism. I do see your point towards
epsilon greedy vs. regret minimization. I will add a section about that before
presenting UCB1. I also will add more information in the epsilon-greedy
strategy section, elaborating on what the epsilon is really for. I'm not 100%
sure how to reframe the nonstationary reward situation, because I feel like
that adds state dependent on t to the bandit scenario, which then feels more
like MDP.

~~~
joshuamorton
>because I feel like that adds state dependent on t to the bandit scenario,
which then feels more like MDP.

Yeah this was sort of exactly the issue I was running into. I can't justify it
to myself without essentially saying "this is just an MDP in disguise", which
maybe is the right way to do it. I'm pretty sure you can define a k-armed
bandit as an MDP on a single state, where each action corresponds to a
machine, and all actions return you to the single state.

So maybe that is the right motivation. But reversing that "an MDP is just a
k-armed bandit problem where sometimes playing a machine breaks it and forces
you to play other machines, which can impact how quickly the casino fixes your
first machine..." feels forced.

All that said, its a good article :)

------
picomancer
It seems like any method for solving this problem could be interpreted in a
Bayesian way: At any time, you consider all the different possible
distributions each arm could have, and assign each a probability which is how
likely you think that distribution is to occur.

The probabilities are initialized to some value (the "prior"), then when you
pull the arm, you get some new information, which you use to update the
probabilities based on evidence.

It would be interesting to try to see if you could analytically solve this
problem for a simple family of distributions. For example, assume each lever
produces Gaussian results, but has an unknown mean and SD. Set the prior to be
that the means are normally distributed with mean 0 and SD 1, and the SD's are
exponentially distributed with mean 1.

~~~
srean
Your intuition is spot on. With this line of argument you will end up with
Thompson Sampling [0]. TS is an old algorithm from the 1930s. It is
disconcertingly effective, even when the underlying Bayesian assumptions are
not correct. It is also extremely hard to analyze. This led to a weird state
of affairs - TS gave excellent empirical performance but we could not make any
theoretical statements about its efficacy. It is only very recently that this
has changed.

[0]
[https://en.wikipedia.org/wiki/Thompson_sampling](https://en.wikipedia.org/wiki/Thompson_sampling)

------
crisbal_
As someone who is doing his bachelor thesis on Reinforcement Learning this is
some useful information.

OT (but not really) question: does anyone here use Reinforcement Learning
techniques at work? For the thesis I am working on black-box optimization of 2
variable functions with Reinforcement Learning (and comparing it with Bayesian
Optimization techniques).

As someone else suggested the Sutton & Barto book is really great knowledge
but I would also like to suggest these lessons
([https://www.youtube.com/watch?v=2pWv7GOvuf0](https://www.youtube.com/watch?v=2pWv7GOvuf0))
by David Silver (who worked on AlphaGo)

~~~
chudi
We used it to discover the best picture of a product in terms of click trough
rate, we also found that the end user really don't want you to change the
order of the pictures of their products

------
leongrandote
At the beginning of this post the given definition of value doesn't seem
correct to me, because I think it should be the expected value of the sum of
all the rewards from t to infinity and not of R_t, but I could be wrong.
Citing D. Silver: The agent's job is to maximise cumulative rewards (highlight
cumulative). I would suggest David Silver course for reinforcement learning,
[http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html),
another interesting post is [https://jeremykun.com/tag/bandit-
learning/](https://jeremykun.com/tag/bandit-learning/) and the following (5
related post comparing implementations of the algorithms in python).

I know I should not discourage people, but you should read the post Deep
reinforcement learning doesn't work yet before drinking all the cool aid:
[https://www.alexirpan.com/2018/02/14/rl-
hard.html](https://www.alexirpan.com/2018/02/14/rl-hard.html)

Edited many times.

~~~
svalorzen
You are right, it should be the expected value for the sum of all future
rewards.

At the same time, given that we're talking about bandits, it doesn't really
matter, since there is no state. Thus, the summatory over time you'd like to
see doesn't change the relative ordering of the actions: the expected value in
your definition is simply the expected reward for the action multiplied by the
number of times you expect to play. So it doesn't really change anything.

~~~
leongrandote
You are right, anyway I think that people playing with a bandit machine are
going to continue playing more time if they are getting a lot of money that if
they are loosing money, so when people are involved in games there is a hidden
state, the mental state of the player. But if you decide up front the number
of steps and you don't change your strategy depending of your mood, then this
formal algorithm work as stated.

~~~
banditlover
Contextual bandits, on the other hand, allow you to put your mood as a context
(features) and your strategy depends on the features. You still have that
simple expectation maximization (instead of a brutally hard to optimize loss),
yet much more flexibility.

------
subroutine
This is interesting. It reminds me of the Rescorla-Wagner model (a model of
classical conditioning that's probably mentioned in every psych/neuro learning
and memory course textbook), which describes the trial-by-trial change in
associative strength between a conditioned stimulus (CS) and unconditioned
stimulus (US).

Though it has its shortcomings, R&M is quite elegant in its simplicity, while
doing a pretty good job modeling some fairly complex behavioral/cognitive
changes (and made an interesting prediction about 'blocking' that ended up
being true).

For more on this...

[http://campus.albion.edu/wjwilson/files/2012/03/RWSimplified...](http://campus.albion.edu/wjwilson/files/2012/03/RWSimplified.pdf)

[http://www.scholarpedia.org/article/Rescorla-
Wagner_model](http://www.scholarpedia.org/article/Rescorla-Wagner_model)

~~~
closed
RL can be seen as a generalization of the Rescorla-Wagner model!

[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4633133/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4633133/)

------
indescions_2018
Nice writeup! Why do we love bandit problems? Not only the efficacy with which
agents can minimize exploration and maximize rewards in unknown worlds. But
also the emergent philosophical truths that arise concerning causality!

Remains very much an active research topic. With applications ranging from
epidemiology, to website optimization ;)

CS7792 - Counterfactual Machine Learning, T. Joachims, Cornell University

[http://www.cs.cornell.edu/courses/cs7792/2016fa/](http://www.cs.cornell.edu/courses/cs7792/2016fa/)

Deep Bayesian Bandits Showdown - Google Brain

[https://arxiv.org/pdf/1802.09127.pdf](https://arxiv.org/pdf/1802.09127.pdf)

------
bagrow
Anyone aware of research on bandits with a growing set of arms? Meaning, every
so often when you play an arm part of your reward is revealing a new arm!

This seems to not fit the criteria anymore (not tabular, not Markov). Is it
related to structured bandits?

~~~
dhotson
Hmm, perhaps something like UCB1? The UCB (upper-confidence-bound) family of
algorithms might be appropriate—it's not quite what you've described though.

They'll explore the arm with the highest potential payoff i.e. the highest
upper-confidence-bound, which often is the arm you know the least about since
they're roughly calculated as (average + confidence interval) and early on the
confidence bound is large. This style of algorithm means you can add arms as
you go through the experiment and they'll be explored/exploited in a
reasonable way.

What you're describing sounds more like you're exploring a frontier and
"discovering" new options along the way..?

------
QML
What's the difference in problem statement between k-bandits and the expert's
problem?

------
mathinpens
nice blog but you should include citations/references even if it is very
informal

