The Low Base Rate Problem is when you have a binary outcome and one of the outcomes is rare (say, less than 1%). There is so little entropy in the information source that you have to acquire a heck of a lot of samples in order for the statistical test to have any power. The problem is not unique to frequentist statistics; it's a consequence of information theory and so it affects Bayesian statistics as well.
Nonetheless, I highly recommend examining Bayesian test techniques to avoid repeated significance testing (both within a single trial and across multiple trials). A side benefit is that when someone says "What's the probability that the new purple dragon logo outperforms the old one?", you can give them an answer without backpedaling and explaining null hypotheses, p-values, significance levels, and all that jazz.
The major drawback to Bayesian techniques is that it tends to be computationally expensive. For example, to evaluate the A/B test and answer the purple-dragon question with normal priors, you have to integrate a normal distribution in two directions, and there's not a clean analytic formula for that. That's why there's a jagged histogram in the blog post; it changes every time you hit "Calculate" because it's being integrated with Monte Carlo techniques, which take a lot of juice compared to (frequentist) analytic methods.
This is surprisingly valuable. As a former quant/math person, I am shocked time and again how most self-professed data driven people have no idea about inferential statistics. I've learned over time that most people just want to see data and descriptive statistics, preferably in visual representations, and interpret them somewhat creatively and pretty much non-rigorously.
These are not the same problems at all, and it's not clear how the authors propose to extend the tests they're informally describing to the test of equality of a parameter in two populations. It is possible in a Bayesian framework, but it's not this simple.
1. As pointed out correctly in the article and "Most A/B-Test Results are illusionary", using a non-sequential frequentist method in a sequential way leads to wrong conclusions. In this regard, the comparison is unfair. On the backside, the frequentists' sequential methods for more than one dimension seem to be rather complicated (for one dimension and binomial distribution, check out the Sequential Probability Ratio Test (SPRT)).
2. Finding a good prior is really really hard. Choosing a weak prior may lead to jumping to wrong conclusions. I found it useful to
a) choose a uninformative prior
b) do not perform a statistical analysis until 1-2 weeks of data are in (to be sure that all special-day-effects like weekends are caught) ... which could be seen as a prior with weight of the same interval ;)
c) using a simple old-fashioned permutation test to control the alpha error and prevent jumping to conclusions
While this approach has turned out to be too conservative in the binomial case, it becomes extremely important when dealing with revenue. Single customers do turn the tide. Often these are clear outliers, but sometimes it is not that easy to say. But this is a topic for another day.
Unfortunately, the requirement of priors (seemingly) precludes Bayesian stats from many A/B testing applications. For example, our site launches every article with 3 or 4 headline variations. A priori, we have no idea which headline will be most successful or even what the expected conversion rate is for that article.
Is there any way to use Bayesian methods in such a scenario?
In this case your prior mean could be something as simple as an average conversion rate for all your past articles.
Remember that by using frequentist statistics your are implicitly assuming a uniform prior---I think it's quite easy to find weakly informative priors which are more plausible than saying that any conversion rate is equally likely.
1. People want their formula to tell them things that are impossible to know objectively (like the probability that headline A is better than headline B).
2. They want methods that are easy to use (just plug in some data and press play).
3. And they care more about getting the job done than doing the job right (going from no number to a decent number is more important than going from a decent number to the best number).
I think the reason that frequentist statistics are popular is that they meet these three (misguided) desires. Frequentist statistics is full of numbers (like p values) that can easily be confused for the probabilities that people desire. People don't actually care that a p value of 0.05 does NOT mean that the null hypothesis has a 5% chance of being true. They just care that there's an easy objective number that justifies taking an action (like choosing one headline over another).
(Of course, this is all my personal speculation.)
The really shine in that they provide a really uniform vocabulary of producing new, more sophisticated models while "frequentist" methods usually rely on ingenuity to get to better models. So if you're actively exploring a model space and want to attach a bunch of assumptions and degrees-of-freedom to correspond with a theory you're testing... then Bayesianism is the way to go.
Theoretically, marketers are doing that exact process. In practice, they don't see that is a statistical process, though. Maybe someday a tool will bridge that gap successfully... but again, you're unlikely to get a huge advantage with Bayesian methods without that increased work investment.