However running experiments to prove that one 'process' is superior to another in a real world situation often involves more than running a chi square test on randomly generated data ;) (which is what I understood your code to be doing on a very brief glance at it- sorry if I got it wrong).
There is nothing wrong with starting your exploration with something like this, but it really isn't sufficient to make the claims in the blog post about specific ways in MAB is better/worse than A/B testing.
This seems to be methodologically dubious. I am not a stats expert, though I use it in my work, and I could be wrong but testing convergence to statistical significance doesn't seem to mean anything (mathematically/statistically)!.
I could be wrong - there are people with Stats PhD's here and I'll leave it to them to tell me if I am. But I've never heard of an experiment (in the formal sense) which ran till a significance checking algorithm crossed some tripwire level.
Is the sampling random? (especially for an MAB) What are the underlying distributions (and assumptions about them)?
In your article you (imo) either need to say something like "this test isn't really valid and so the conclusions shouldn't be taken very seriously" (which would be weird thing for a company with your product to blog!) or you should do some rigorous experiment design and be able to defend your results.
Right now it reads like some MBA type plugged in two black box formulae in an Excel sheet and drew conclusions from the results. (please note: I am not saying you did it that way -just saying that the article reads like not enough thought went into setting up a proper experiment)
(Statistical) Experiment Design is a lot of work and surprisingly hard to carry off well, and involves much subtlety - and theory! there are whole books written about it. Maybe time to buy a few? :)