Stopping a test when you reach a "statistically significant" result is the wrong way to do A/B testing. In both multi-armed bandit and A/B testing you need to set ahead of time the number of users you are going to run your test against and stop the test at that point regardless of if your result is significant or not.
See http://elem.com/~btilly/effective-ab-testing/index.html#asli... for part of a presentation that I did where I actually set up some reasonable fake tests, and ran simulations. What I found is that if there is a significant difference, the probability of coming to the wrong conclusion was (as you would expect) higher, but not that high before the underlying difference made mistakes incredibly unlikely. Conversely if there is only a small real difference, the amount of data needed before you have a significant chance of having accidentally come to a erroneous conclusion is very, very long.
So avoid accepting any result where you don't have at least a few hundred successes and set your thresholds reasonably high. You will make fewer mistakes than you probably fear, and the ones that you make will almost always be very minor. (Oops, I had a 3% chance of accepting the 1% worse solution as probably better.)
Of course if you're aiming to publish academic research, your standards need to be higher. But if you're more interested in getting useful results than publishable ones, you can relax your standards. A lot.
Nobody said that it was. But when you do regular split testing, you can use power analysis to estimate the length of time you need to run an experiment to get a significant result at a certain precision:
You can't do this (at least, not easily) when you're using bandit models, because none of the assumptions are valid.