Bandit Algorithms for Recommendation Systems

j2kun · on June 2, 2014

I've been studying a handful of these bandit algorithms for a while now (with some technical expositions on my own blog [1, 2]), and my PhD advisor has done a lot of work on improving the theoretical guarantees for various settings.

But the more I read about bandit algorithms the more I think they're not enough. In particular, they don't take into account natural things we have in practice: similarities in the structure of the actions, correlations in rewards across rounds, and the ability to precompute without measuring your regret against the results.

There are some new research directions that do each of these, though. People are now starting to impose metrics on the actions so that "closer" actions have more correlated rewards. They have also been doing a lot of work on "contextual bandits" in which every round comes paired with a context (i.e. browser type, information about past searches, etc), and relationships among the contexts can be leveraged. And finally, there is a model of "pure exploration," in which you're given some fixed (but unknown) amount of time to prepare, and the recommender you come up with when that period ends is judged into the future of how it would perform.

I don't have a lot of intuition about which algorithms/settings are the best for which situations, but if you're interested in this kind of thing you check out the work of John Langford, Peter Auer, Sebastien Bubeck, and the many others in this field.

[1]: http://jeremykun.com/2013/10/28/optimism-in-the-face-of-unce... [2]: http://jeremykun.com/2013/11/08/adversarial-bandits-and-the-...

sergeyfeldman · on June 2, 2014

Thanks for the comment. The third blog post in this series will discuss linear contextual bandits, for which I used Langford's VW.

It does a huge amount better.

cbsmith · on June 3, 2014

Yeah, in general, the contextual bandit approach seems to be a better strategy.

noelwelsh · on June 2, 2014

No love for Thompson Sampling?

sergeyfeldman · on June 2, 2014

This is blog post 1 out of 3. The second is all about Thompson Sampling, and should be coming out on Friday =)

splitforce · on June 3, 2014

Nice post, Sergey. We've been using Thompson Sampling to deal with delayed feedback, as we're dealing with data coming from mobile applications which are not always connected. The results have been pretty good, here's a breakdown of how it works if you're interested: https://splitforce.com/resources/auto-optimization/

Have you thought about how to deal with changes in environmental factors over relatively longer periods of time? For example, seasonality or changes in popular taste.