Maybe a better way to do the exploitation phase would be to sample from the different choices with probabilities proportional to their expected payoff instead of just choosing the one with maximum expected payoff.

