Your simulation is a fiction. It only informs you of the behavior of the code you wrote. To connect statistical significance to reality requires honoring criteria you have not met.
Applying frequentest evaluation concepts to Bayesian and Markov learning methods will lead to mistaken conclusions because they follow from a different set of axioms. They both lead to numbers, but these numbers are not interchangeable and you must remember their qualifications when applying them to predicting reality.
In more frequentest terms, multi-arm bandit is searching for minimum total regret via a random walk, not rapid convergence to a significance predicate. It can do this in the face of changing conditions and weaker assumptions concerning control than frequentest significance requires, which is why such methods are now the norm in machine learning.
There are huge opportunities that you will not see with your current perspective. If you do not learn them, your competitors will.
I do not know of a particular reference for multi-arm bandit algorithms in specific, but they are a specialized case of Markov model learning described with vocabulary that predates a more modern broad and uniform understanding of these topics.
David Barber's recent book is very good.
If you want a broad understanding of common mistakes due to unstated metaphysical assumptions concerning statistics, experiment and action, read Judea Pearl's 2nd edition.