This is a good discussion, but the author is confused about what the "Low Base Rate Problem" is. It doesn't have anything to do with the null hypothesis being true most of the time -- the example the author gives is actually a second form of repeated significance testing, which could be addressed with a Bonferroni or Šidák correction.
The Low Base Rate Problem is when you have a binary outcome and one of the outcomes is rare (say, less than 1%). There is so little entropy in the information source that you have to acquire a heck of a lot of samples in order for the statistical test to have any power. The problem is not unique to frequentist statistics; it's a consequence of information theory and so it affects Bayesian statistics as well.
Nonetheless, I highly recommend examining Bayesian test techniques to avoid repeated significance testing (both within a single trial and across multiple trials). A side benefit is that when someone says "What's the probability that the new purple dragon logo outperforms the old one?", you can give them an answer without backpedaling and explaining null hypotheses, p-values, significance levels, and all that jazz.
The major drawback to Bayesian techniques is that it tends to be computationally expensive. For example, to evaluate the A/B test and answer the purple-dragon question with normal priors, you have to integrate a normal distribution in two directions, and there's not a clean analytic formula for that. That's why there's a jagged histogram in the blog post; it changes every time you hit "Calculate" because it's being integrated with Monte Carlo techniques, which take a lot of juice compared to (frequentist) analytic methods.
>A side benefit is that when someone says "What's the probability that the new purple dragon logo outperforms the old one?", you can give them an answer without backpedaling and explaining null hypotheses, p-values, significance levels, and all that jazz.
This is surprisingly valuable. As a former quant/math person, I am shocked time and again how most self-professed data driven people have no idea about inferential statistics. I've learned over time that most people just want to see data and descriptive statistics, preferably in visual representations, and interpret them somewhat creatively and pretty much non-rigorously.
Thanks for the comment! What we were trying to get at is running repeated experiments when prior probability of an experiment being successful is low---which you correctly point out is also about repeated testing (and has nothing to do with power). So perhaps our naming is unfortunate.
I think using Bayesian methods is a good thing. The reservation I have about this post is that it's motivated as A/B testing, which in classical frequentist testing usually equates to a null hypothesis that some parameter is equal in two populations (in this example, conversion rate under method A is equal to conversion rate under method B). The rest of the post then describes a test of a single population against a known value (is the conversion rate under method B = 5%).
These are not the same problems at all, and it's not clear how the authors propose to extend the tests they're informally describing to the test of equality of a parameter in two populations. It is possible in a Bayesian framework, but it's not this simple.
Thanks for the comment. We do look at comparing proportions in two samples (have a look at our calculator: http://developers.lyst.com/bayesian-calculator/). The way we do this is by constructing a posterior distribution of differences between the two proportions through taking samples from the proportion posterior distributions and taking their difference. But yes, we should have made it more obvious in the post!
1. As pointed out correctly in the article and "Most A/B-Test Results are illusionary", using a non-sequential frequentist method in a sequential way leads to wrong conclusions. In this regard, the comparison is unfair. On the backside, the frequentists' sequential methods for more than one dimension seem to be rather complicated (for one dimension and binomial distribution, check out the Sequential Probability Ratio Test (SPRT)).
2. Finding a good prior is really really hard. Choosing a weak prior may lead to jumping to wrong conclusions. I found it useful to
a) choose a uninformative prior
b) do not perform a statistical analysis until 1-2 weeks of data are in (to be sure that all special-day-effects like weekends are caught) ... which could be seen as a prior with weight of the same interval ;)
c) using a simple old-fashioned permutation test to control the alpha error and prevent jumping to conclusions
While this approach has turned out to be too conservative in the binomial case, it becomes extremely important when dealing with revenue. Single customers do turn the tide. Often these are clear outliers, but sometimes it is not that easy to say. But this is a topic for another day.
I had the pleasure of taking a great Bayesian stats class in college and these methods seem preferable.
Unfortunately, the requirement of priors (seemingly) precludes Bayesian stats from many A/B testing applications. For example, our site launches every article with 3 or 4 headline variations. A priori, we have no idea which headline will be most successful or even what the expected conversion rate is for that article.
Is there any way to use Bayesian methods in such a scenario?
You could use identical priors for each variant, with the mean equal to the empirical mean of your existing system, and the variance controlling how plausible big improvements are.
In this case your prior mean could be something as simple as an average conversion rate for all your past articles.
Remember that by using frequentist statistics your are implicitly assuming a uniform prior---I think it's quite easy to find weakly informative priors which are more plausible than saying that any conversion rate is equally likely.
I guess that's as good a place as any to ask some more general (and naive) question : how come Bayesian methods still haven't taken over the testing market given their (as perceived by my narrow understanding) advantages?
1. People want their formula to tell them things that are impossible to know objectively (like the probability that headline A is better than headline B).
2. They want methods that are easy to use (just plug in some data and press play).
3. And they care more about getting the job done than doing the job right (going from no number to a decent number is more important than going from a decent number to the best number).
I think the reason that frequentist statistics are popular is that they meet these three (misguided) desires. Frequentist statistics is full of numbers (like p values) that can easily be confused for the probabilities that people desire. People don't actually care that a p value of 0.05 does NOT mean that the null hypothesis has a 5% chance of being true. They just care that there's an easy objective number that justifies taking an action (like choosing one headline over another).
Because Bayesian probability is not intuitive as we (as a society) currently approach statistics. If we start teaching Bayesian probability in grade school then maybe the next generation will "get it" sufficiently to make mainstream applications possible.
Assuming I'm a marketer who's using A/B testing tools and who's only interested in getting "statistically significant" results, and assuming Bayesian methods provide some advantage in terms of regret minimization that translates to real dollars earned, I feel that I could totally outsource my understanding of the theoretical underpinnings that get me that result. After all, what percentage of Optimizely or VWO users perfectly understands the statistical framework the tools are based on? So I'm not really convinced by that line of reasoning.
Bayesian methods don't perform so much better in the situation you just named. They're also more expensive.
The really shine in that they provide a really uniform vocabulary of producing new, more sophisticated models while "frequentist" methods usually rely on ingenuity to get to better models. So if you're actively exploring a model space and want to attach a bunch of assumptions and degrees-of-freedom to correspond with a theory you're testing... then Bayesianism is the way to go.
Theoretically, marketers are doing that exact process. In practice, they don't see that is a statistical process, though. Maybe someday a tool will bridge that gap successfully... but again, you're unlikely to get a huge advantage with Bayesian methods without that increased work investment.
I think is more because who is running A/B testing are not knowledgeable in stats, there are many other pitfalls that people fall into (in my personable experience) much worst than not using Bayesian rule, like sampling
Yea, I really agree with that statement. There seems to be much bigger problems that most people (starting out or moderately advanced) have with A/B testing. I hate when I see articles that talk about getting amazing results in "just 2 days."
I'd argue that Bayesian probability is more intuitive to the non-statistically trained. Once people know frequentist statistics, it's more difficult to talk to them about Bayesian results. I think you're right that the future could be more Bayesian if tomorrow's statistical wizards weren't inculcated into a frequentist mindset.
You can read The Theory That Would Not Die for the gory details but the short of it is that various people objected to the notion that probabilities could be used to encode a degree of certainty; and that the existence of priors made the whole thing unscientific.
The Low Base Rate Problem is when you have a binary outcome and one of the outcomes is rare (say, less than 1%). There is so little entropy in the information source that you have to acquire a heck of a lot of samples in order for the statistical test to have any power. The problem is not unique to frequentist statistics; it's a consequence of information theory and so it affects Bayesian statistics as well.
Nonetheless, I highly recommend examining Bayesian test techniques to avoid repeated significance testing (both within a single trial and across multiple trials). A side benefit is that when someone says "What's the probability that the new purple dragon logo outperforms the old one?", you can give them an answer without backpedaling and explaining null hypotheses, p-values, significance levels, and all that jazz.
The major drawback to Bayesian techniques is that it tends to be computationally expensive. For example, to evaluate the A/B test and answer the purple-dragon question with normal priors, you have to integrate a normal distribution in two directions, and there's not a clean analytic formula for that. That's why there's a jagged histogram in the blog post; it changes every time you hit "Calculate" because it's being integrated with Monte Carlo techniques, which take a lot of juice compared to (frequentist) analytic methods.