I wouldn't sweat too much UCB methods vs e-greedy or other heuristics for balancing explore/exploit. E-greedy (and e-greedy decreasing) is nice because it is simple. Softmax/Boltzman, is interesting since it is satisfying in that it selects arms weighted by the estimated means, and UCB-Tuned and UCB-Normal are nice because, like AB testing, they take variance measures directly into account when selecting an arm.
Take a look at this paper from Doina Precup (who is super nice BTW) and Volodymyr Kuleshov from 2000 http://www.cs.mcgill.ca/~vkules/bandits.pdf they have comparisons between various methods. Guess what - the simple methods work just fine.
Of course there are various Bayesion versions - esp of UCB. Nando de Freitas over at UBC has a recent ICML paper on using the Gaussian Process for Bandits (based on a form of UCB). See http://www.cs.ubc.ca/~nando/papers/BayesBandits.pdf
I have not given it a tight read, but not sure what the practical return would be. Plus you have to fiddle with picking a Kernel function, and I imagine length scales and the rest of hyper parameters associated with GPs.
I did read a working paper from Nando a few years back that used a random forest as a prior - I can't seem to find it now.
BTW - John Langford is program chair of this year’s ICML over in Edinburgh. If you are in the UK might be worth it to pop up and attend. Plus Chris Williams is there at Edinburgh, so maybe you can corner him about GPs. Although he has moved on from GPs - he still wrote (well, co-wrote) the book and is one of the smartest people I have ever met.