From a EE's point of view, what the OP said is that there's too much noise involved in multi-armed bandit algorithm. If that's the case, a low pass loop filter is needed. It's kind of like locking a PLL to a reference clock right?

