If A smashes B, then bandit will converge in a very obvious manner on A, and you pick it and delete the B code branch, accepting future regret from the possibility that B was in fact better as a cost of doing business. If A doesn't smash B, then at an arbitrary point in the future you realize it has not converged on either A or B, pick one branch using a traditional method like "I kind of like A myself", delete the B code, and go on to something that will actually matter for the business rather than trying to minimize your regret function where both possible solutions are (provably) likely non-motivational amounts of money between each other.
Please feel free to correct me if I'm wrong on this -- I have a bad flu today and take only marginal responsibility for my actions.
However, you are mathematically wrong. Doing things the way you describe is another form of A/B testing, just one designed to reduce the regret incurred during the test. You still get the linear growth in regret I described (i.e., there is some % chance you picked the wrong branch, and you are leaving money on the table forever after).
Of course, your way is also economically correct, and the theoretical CS guys are wrong. They shouldn't be trying to minimize regret, they should be minimizing time discounted regret.
(Yes, I'm being extremely pedantic. I'm a mathematician, it comes with the territory.)
A guess at good strategy in those situations would be something that enables you to choose a good path from a small number of options and keep doing so - on the basis that the cumulative advantage will be far more important than any particular choice in itself. Sounds like A/B testing would win if you could actually map the complexity of the problem correctly.
I think that's a bad thing, though. A single conversion metric optimization is only good for some (large or medium) effect size. After a point, criterions like "I kind of like A" are much more important.
In A/B testing you see this when your ambitious testing campaign returns "insignificant". In MAB you see it when two choices run at roughly 50/50 enrollment for a long period of time.
So on these two extremes, I think practical use of A/B and MAB should be roughly identical. In the middle ground where A is usefully but not incredibly better than B, I feel they must differ.
Once you get your expected regret below the cost of maintaining the old code, rip out the old code!