A very common example would be you might have a form that converts higher on the weekends than the weekdays just because of some external factor. You fire up a test with this "multi-arm bandit method" on Friday morning with two variations that are exactly equal. By random chance variation A starts winning by the end of Friday, and then it gets shifted to 90% of impressions. Due to this external factor, that it is now the weekend and that naturally increases conversions, variation A's conversion rate is going to excel past that of variation B, maybe even significantly. However, it in fact isn't better at all.
Now this may be uncommon, but think about this: Your website is changing all the time. If you happen to make a change on your website that affects conversion rates and you do not keep consistent bucket assignment ratios, then any ongoing experiment is going to be tainted.
I appreciate that the "multi-armed bandit" method increases the conversion rates on your site during the test, but to do be honest when I am running a test I'm not worried about increasing the conversion rate during the test. I'm worried about increasing the conversion rates in the much longer future, and the best way to accurately do that is to get accurate and significant results from my test as quickly as possible.
This also bunks one of the advantages in the blog post that says you can add variations at any time. You cannot add variations at anytime, because if you do you are comparing apples to oranges. The conversion rates from time period A-->C cannot be assumed to be the same as B-->C, because there likely have been changes in between A and B that could affect the "base" conversion rates (like website changes, external factors like weekend vs weekday, etc). When you add a new variation, you are creating a new A/B test that must be analyzed independently.
However I have been thinking about it since, and it is possible to design a multi-armed bandit approach with logarithmic regret (though higher by a constant factor than a traditional approach), that can handle the varying performance ratio. It also would allow you to add variations at any time.
There remain operational differences, but this problem is fixable.
Here is the absurd case. The conversion rate is 10% on Friday and 50% on Saturday:
Friday (~10% conversion):
A: 10 / 100
B: 11 / 100
Saturday (~50% conversion):
A: 5 / 10
B: 45 / 90
A: (10 + 5) / (100 + 10) = 13.6%
B: (11 + 45) / (90 + 100) = 29.5%
This is 99.9% statistically significant even though both variations are exactly the same: http://www.thumbtack.com/labs/abba/#A=15%2C110&B=56%2C19...
Here are some things I've found:
1) Absolute conversion rate. After a certain point, whatever you add will just detract from the performance of something else. That detraction could either be from the landing page itself (if your lucky) or some longer term variable (hope those "conversions" don't cost you too much money.) I have had both occur.
2) "Statistically significant" can just be noise when variables are fairly close to each other. After getting rid of the obvious losers, I've watched the "winner" of elements like button color change back and forth daily for weeks, with no clear winner, even with 30,000+ conversions a day flowing through. This is the kind of thing visualwebsiteoptimizer would write a case study on 1 hours worth of traffic and declare a winner.
3) You brand can be shit on when dealing with returning users. They are used to seeing one thing and now they see something else. Imagine if the colors, theme, and button locations changed every day (or to be fair, once a week) when you visited hn. Often "better converting" designs actually convert worse when introduced on existing customers.
4) Failure to account for external variables, especially when dealing with small sample sizes. Testing is often done most vigorously with paid traffic sources as the monetary incentive is direct. The traffic source itself is often a bigger determining factor behind the conversion rate than the design. Small budget/sample size, and you could end up with some pretty poor test results that the math will tell you are correct.
I am not saying don't test. I am saying a/b testing, split testing, multivariate testing, etc is abused