I've been studying a handful of these bandit algorithms for a while now (with some technical expositions on my own blog [1, 2]), and my PhD advisor has done a lot of work on improving the theoretical guarantees for various settings.
But the more I read about bandit algorithms the more I think they're not enough. In particular, they don't take into account natural things we have in practice: similarities in the structure of the actions, correlations in rewards across rounds, and the ability to precompute without measuring your regret against the results.
There are some new research directions that do each of these, though. People are now starting to impose metrics on the actions so that "closer" actions have more correlated rewards. They have also been doing a lot of work on "contextual bandits" in which every round comes paired with a context (i.e. browser type, information about past searches, etc), and relationships among the contexts can be leveraged. And finally, there is a model of "pure exploration," in which you're given some fixed (but unknown) amount of time to prepare, and the recommender you come up with when that period ends is judged into the future of how it would perform.
I don't have a lot of intuition about which algorithms/settings are the best for which situations, but if you're interested in this kind of thing you check out the work of John Langford, Peter Auer, Sebastien Bubeck, and the many others in this field.
Nice post, Sergey. We've been using Thompson Sampling to deal with delayed feedback, as we're dealing with data coming from mobile applications which are not always connected. The results have been pretty good, here's a breakdown of how it works if you're interested: https://splitforce.com/resources/auto-optimization/
Have you thought about how to deal with changes in environmental factors over relatively longer periods of time? For example, seasonality or changes in popular taste.
But the more I read about bandit algorithms the more I think they're not enough. In particular, they don't take into account natural things we have in practice: similarities in the structure of the actions, correlations in rewards across rounds, and the ability to precompute without measuring your regret against the results.
There are some new research directions that do each of these, though. People are now starting to impose metrics on the actions so that "closer" actions have more correlated rewards. They have also been doing a lot of work on "contextual bandits" in which every round comes paired with a context (i.e. browser type, information about past searches, etc), and relationships among the contexts can be leveraged. And finally, there is a model of "pure exploration," in which you're given some fixed (but unknown) amount of time to prepare, and the recommender you come up with when that period ends is judged into the future of how it would perform.
I don't have a lot of intuition about which algorithms/settings are the best for which situations, but if you're interested in this kind of thing you check out the work of John Langford, Peter Auer, Sebastien Bubeck, and the many others in this field.
[1]: http://jeremykun.com/2013/10/28/optimism-in-the-face-of-unce... [2]: http://jeremykun.com/2013/11/08/adversarial-bandits-and-the-...