
Multi-Armed Bandits and the Stitch Fix Experimentation Platform - jonbaer
https://multithreaded.stitchfix.com/blog/2020/08/05/bandits/
======
meigetsu
I have mixed feelings about using multi-armed bandit for product testing like
this. Regret minimization makes sense 100% as a framework if you are testing a
large inventory of things - i.e. the classic examples of showing ads or
recommendations - since there might be some real opportunity cost in not
showing some of the things in inventory (particularly if the inventory has a
shelf life). (I'm also quite surprised they don't use thompson sampling...)

For testing product features though, I feel like there is often a high long
term cost to the dev team and the regret from showing users a non-optimal
treatment during the experiment is pretty minimal (the regret is usually to
first order only the cost of experimental bandwidth).

The team cost comes in several subtle forms:

\- in practice, bandits encourage lots of small experiments which leave behind
a large surface area graveyard of code - you can mitigate this by having
strict stopping points for bandit experiments

\- bandits have higher statistical power, but also higher false-positive rate;
false positives can be quite high cost since they cause thrash and require
time to investigate if a feature that tested well does poorly in production

\- you are introducing novelty effects over time as new sample groups get
added in the dynamic allocation; probably nbd for most experiments, but it's
complicated to correct for this if your experiment has novelty effects

\- there are often cyclical time-dependent changes in the composition of users
being exposed (daytime vs night time, week day vs weekend, geography bc of
timezone differences); also, probably nbd for most experiments, but requires
complex stratification to correct for if this is an issue

I would also say that the majority of product changes have small, but
measurable effects on metrics, so I'm not sure that bandits help all that much
in those cases. If there are runaway successes or failures, early stopping
techniques seem like a better way to free up resources - early stopping
policies can be tuned to address the experiment design problems above fairly
simply.

Again, this is all for product testing. I think for recommendations and
personalization, contextual bandits make lots of sense.

~~~
trumpeta
> I'm also quite surprised they don't use thompson sampling

Half of the article talks about how they use Thompson Sampling

~~~
meigetsu
huh, wierd - I saw this post in Aug and can't understand how I missed that.
Thanks for pointing that out - it does indeed discuss it.

------
vii
Setting a good objective function is pretty hard. In this context of consumer
goods, it is at the intersection of three difficult problems:

\- equivalent to incentivising salespeople, which is known to be very
difficult, as short term incentives often are in opposition to long term ones

\- distinguishing and dealing with spammers, robots and crawlers

\- and setting up a stable reinforcement learning behaviour even for the short
term, which is tough even without the first two problems

For these reasons, naturally business partners, designers, and others will be
very curious how the bandit affects the customer experience.

Many years ago to solve this I made a system that would emit a list of
(suboptimal) rules to exploit the opportunities learnt from small A-B test
groups (like an epsilon greedy contextual bandit). These rules were reviewed
by relevant stakeholders and then explicitly deployed to production as a
configuration change, which allows for manual consideration of issues in the
three above areas that are hard to automate.

~~~
alextheparrot
Producing a set of impactful decision boundaries as functions and then
manually curating the functions reminds me how much work maintaining rule
based systems can be. Moreover, so much time is spent on figuring out which
rules might be helpful in the first place - this being partly what makes rule
based systems traditionally brittle (It takes far longer to evolve the rule-
based system than to work around the rules).

I really like the idea of models producing functions over values, thanks for
sharing that insight.

------
sakjdlask
Stitchfix blog posts are always very smart with a lot of equations but last
time there was a article about the company on HN and the comments were saying
things like 'I have explicitly told them not to send me white shirts and they
keep doing it'.

I find it a bit paradoxical.

