We got no more than 100 hits a day, with no more than 2-3 conversions a day, and he would run these tests for, like, 2 days.
I hated it, and the website looked horrible because everything was competing with each other and just used whatever random color won.
By contrast, I've evolved multiple websites through incremental, globally measured, optimizations. It's a lot of fun and it requires you to really understand your user (I've called AB testing+analytics "a conversation between you and your users"). But, as you point out, it can be tough to get statistically relevant data on changes to a small site. That's why I usually focused on big effects (e.g. 25%), rather than on the blog posts about "OMG! +2.76% change in sales!". That's also why I did a lot of "historical testing", under the assumption that week-week changes in normalized stats would be swamped by my tests.
This is an enormously problematic assumption, which you can verify by either looking at the week-to-week stats for a period prior to you joining the company, or (for a far more fun demonstration) doing historical testing of the brand of toothpaste you use for the next 6 weeks. Swap from Colgate to $NAME_ANOTHER_BRAND, note the improvement, conclude that website visitors pay an awful lot of attention to the webmaster's toothpaste choices.
This kind of "historical testing" (I think people often call it sequential testing?) can be pretty dangerous even for large effects. For example Christmas might be a really good time to change the colour of all the buttons on your site and see a 50% increase in sales.
Tests have value, but just making your site/app very simple and completely non-confusing to the viewer can do something years of split tests will not.
I suggest running tests and monitoring metrics as you implement design changes, not so much as a magic eight ball, but to ensure you avoid truly catastrophic UI fuck ups.
I see a lot of this kind of testing going on in the industry and it's frustrating. A/B testing can be a massive tool for your business if it's done right but obviously if you only wait for 2-3 conversions, you're not learning much... "Good" to hear that other people feel the same way!
You can measure more upper-funnel things though like I said in the GP, which can be very helpful, especially in combination with qualitative feedback, although this depends on what exactly your business is...
Sadly, many foolish "SEOs"/"digital marketers"/"growth hackers" have this same mentality that such subtle changes--despite low traffic--still offer meaningful information to digest and further analyze. But hey, they gotta keep their boss/clients on-board for the thrill and payment, right? For everyone out there, remember that often outside the highest echelon of traffic levels, this testing is often performed by marketers with BAs in business administration, marketing, or liberal arts degrees like me. They are often not the statisticians referenced in this document. And sadly they may likely be people unlike me, unwilling to stretch out into a programming languages for data analysis and may have never cracked open a book on statistics. But frankly they have other things to worry about--like staying in your budget and overall digital branding and marketing strategy. Their budget and time is likely better applied outside of A/B testing.
If you have a mathematics background, reach out to your marketing department. If you consider yourself a math-wiz, reach out to the "growth hacker" or "SEO" a few feet away. They deal with the stuff you don't want to deal with. You deal with the stuff they don't want to deal with. Help each other out and engage in a conversation to better help your business. At least your superiors would appreciate it.
Personally, when it comes to landing pages, I test much more dramatic shifts--significant changes to the entire design or to the header imagery along with call-to-action. I don't buy into the testing of slight adjustments to things like font size or button color (and especially when there is such so low volume). That said, I've never worked with hundreds of thousands of visitors per month on a site, where anyone would imagine smaller changes for testing can make a bit more sense to look into.
gkoberger, I'm sorry you hated your first webdev internship. I would have hated it too.
On a side note (making specific reference to the document instead of the comment!), I really enjoyed point #3. This speaks very much to the often short-lived A/B testing of low-volume AdWords text ads. The data is often ALL over the place despite the (otherwise) "professional" use of the platform.
I can't imagine how A/B tests are a productive use of time for any site with less than a million users.
There are so many more useful things you could be doing to create value. If you're running a startup you should rather have some confidence in your own decisions.
And then yeah, I'm sure a lot of successful A/B tests will get washed.
I've treated the A/B tests I've run pretty much as a case of Bayesian parameter estimation (where the true conversion of A and of B are your parameter). You then get nice beta distributions you can sample from, as well as use the prior to constrain expectations of improvement and also reduce the effects of early flukes in your sampling.
Until then, it's A/B, p value <.05, ignore bias and sample size for companies who aren't large enough to have a statistician or data scientist.
The only cost of the Bayesian method is that the bayesian python script is thousands of times slower than the frequentist one. I didn't do benchmarks, but in terms of order of magnitude, the frequentist method might take 1 microsecond while the Bayesian method might take 1 second.
 He used a less advanced version of the method which used a normal approximation - not that he needed to know the difference.
Sometimes bad scenarios will get good results, by luck, and sometimes good scenarios will get bad results, by luck.
Using more advanced statistical methods doesn't change that these cases are fundamentally indistinguishable.
If the differences are drastic enough you can still get value from split testing. Incremental changes are just probably not going to bring you much luck.
For example I just simulated some bad data. A has 480 observations and a mean conversion of 33%, B has 410 observations and has a mean conversion of 37%. The p-value here is 0.0323 In the traditional A/B testing model we'd be done and claiming better than a 10% improvement!
However when I sample from these 2 beta distributions I see that my credible region is -2% to 34% meaning this new test could be anywhere from 2% worse to 34% better. No magic value is needed to tell you that you really don't know anything yet.
Another huge help is the use of a prior. Until your data overrides your prior belief you aren't going to see anything. Going with the last example, if I had a good prior that the true conversion rate on that page was actually 33% I wouldn't have even gotten a p-value of less then 0.05. On the other hand if I had a strong prior that the conversion rate was 50% that would imply that both A and B were getting strangely unlucky results, which would actually boost the probability that B was in fact an improvement.
On the philosophical side, Bayesian statistics are simply trying to quantify what you know, not give you 'yes'/'no' answers. Maybe the gamble of -2 to 34 is good for you, or maybe you really want to know tighter bounds on your improvement and aren't comfortable with any possibility of decline. Bayesian statistics gives you a direct way to trade off certainty with time.
Just wanted to add that if you have less than a million users you can A/B test for upper funnel goals, effectively measuring if changes improve engagement. Obviously then you have the problem of working out if the engagement translates into more sales but perhaps you're willing to wait longer to find out if a test that improves engagement leads to more revenue in the long run.
For more sophisticated tests, ones less likely to be seen in an A/B scenario, you might not be able to reverse the mathematics and get a direct answer, so often people will run simulation studies to guess at the needed sample size.
I wrote something about it here:
Without something like this, it's very easy to fall into the "we'll collect data until something that looks significant appears" trap.
I'd love to hear some clarification on how "[making] sure your math acknowledges" stopping randomized tests early works, because it flies in a face of how actual statistics works.
The terms you're looking for are 'sequential testing', 'optimal stopping', and 'adaptive trials'.
data is data. as long as the method being used to collect individual data points is fine, and they are collected independently, then the data you get as a result is gonna be OK, the rest (like arbitrary stop time) doesn't ruin it. you just have to avoid bad math.
what ruins data is stuff like throwing 10% of the heads results in the trash or using other approaches in which data can be selectively discarded or not discarded. so just stopping arbitrarily can be a problem if you might never stop and throw out the results if you don't like them. but if you do something like "stop after 1 million data points max, or when i feel like it earlier" then your data is still OK because it cannot get selectively ignored.
stopping earlier cannot make a fair coin look unfair or anything like that.
this is not some random unknown position that flies in the face of how actual statistics works. something like this is the standard bayesian position, and i think it's true. (i strongly object to bayesian epistemology, but i think bayesian statistics is correct).
not ALL stopping rules are OK but lots are. you don't HAVE to use simple ones like "gather X data points, stop".
see stuff like:
> And then there's the Bayesian reply: "Excuse you? The evidential impact of a fixed experimental method, producing the same data, depends on the researcher's private thoughts? And you have the nerve to accuse us of being 'too subjective'?"
FTA: Bayesian experiment design: With Bayesian experiment design you can stop your experiment at any time and make perfectly valid inferences.
Which is to say you can definitely pick better choices for alpha, but it's really hard so everyone just picks whatever their field agrees is "OK". In science it's often 95%.
While the OP's article targets some low-hanging fruit, like halting criteria, multiple hypotheses, etc. which should be familiar to anyone serious about bioinformatics and statistics, Ioannidis takes these things a little farther and comes up with a number of corollaries that apply equally well to A/B testing.
After all, the randomized controlled trials that the FDA uses to approve new drugs are essentially identical to what would be called an A/B test on Hacker News.
Use them to really know if conversion rate is significantly different, whether the mean value of two groups is significantly different and how to calculate sample size:
most people have conversion rates between 1-3%
Gelman et al, "Why we (usually) don't have to worry about multiple comparisons" http://arxiv.org/abs/0907.2478
> We know that that, occasionally, a test will generate a
> false positive due to random chance - we can’t avoid that.
> By convention we normally fix this probability at 5%. You
> might have heard this called the significance probability
> or p-value.
> If we use a p-value cutoff of 5% we also expect to see 5
> false positives.
A p-value is the chance a result at least as strong as the observed result would occur if the null hypothesis is true. You can't "fix" this probability at 5%. You can say "results with a p-value below 5% are good candidates for further testing". The fact that p-values of 0.05 and below are often considered significant in academia tells you nothing about the probability of a false positive occurring in an arbitrary test.
In his described scenario there are 90 cases where the null hypothesis is true (he states as a premise: "10 out of our 100 variants will be truly effective").
So strictly, we expect to see 5% of 90 = an average of 4.5 false positives (he says 5 false positives).
[Edited to add: False positive rate is measured as a conditional probability https://en.wikipedia.org/wiki/False_positive#False_positive_...]
So if you test 100 times, you'd expect to wrongly reject the Null Hypothesis 5 times.
I don't think this is right. A p-value cutoff of 0.05 doesn't, by itself, indicate anything about the underlying probability of incorrectly rejecting the null hypothesis. It tells you, in a test that meets your cutoff, if the null hypothesis is true, the chance of seeing a results as strong or stronger than the results observed in the test is 5% or less. But that can't tell you the chance you're wrong in rejecting the null hypothesis.
A 1% chance of seeing results as strong as your results if the null hypothesis is true does not mean that there's a 99% chance of the null hypothesis being false.
Regardless, even putting this disagreement to one side, I still don't see how the original author's point makes sense. He or she seems to be using the cutoff as an indication of the underlying false-positive probability for prospective tests, regardless of the results of those tests meet the cutoff or not.
"Let’s imagine we perform 100 tests on a website and, by running each test for 2 months, we have a large enough sample to achieve 80% power. 10 out of our 100 variants will be truly effective and we expect to detect 80%, or 8, of these true effects.
If we use a p-value cutoff of 5% we also expect to see 5 false positives. So, on average, we will see 8+5 = 13 winning results from 100 A/B tests."
If we expect 10 truly effective tests and 5 false positives, we'd have 15 tests that rejected the null hypothesis of h_0=h_test. Taking power into account, shouldn't we see 15*0.8, 12 winning results from the results? I.e. wouldn't one of the false positives also have not-enough-power?
Maybe the confusion here is in tests which have a "true" effect and an "observed" effect. If an experiment has a true effect, then you have some chance to observe it, which is the power.
But false positives have by definition already been observed as winners (that's what false positives are), so there's no need to apply the factor of 0.8 to them.
1. Underpowered tests are likely to exaggerate differences, since E(abs(truth - result)) increases as the sample size shrinks.
2. The much bigger problem I've seen a lot: when users see a new layout they aren't accustomed to they often respond better, but when they get used to it, they can begin responding worse than with the old design. Two ways to deal with this are long term testing (let people get used to it) and testing on new users. Or, embrace the novelty effect and just keep changing shit up to keep users guessing - this seems to be FB's solution.
What bothers me about A/B tests is when people say, eg."there was a 7% improvement" without telling us the sample size, or error margin. I'd rather hear: On a sample size of 1,000 unique visits, the improvement rate was 7% +/- 4%
So you get statements like "This is a well-known phenomenon, called ‘regression to the mean’ by statisticians. Again, this is common knowledge among statisticians but does not seem to be more widely known."
I thought that was hilarious.
I am agnostic about whether most A/B testing practitioners administer their tests correctly -- of the universe of companies I've seen, far and away the most common error regarding A/B testing is "We don't A/B test.", which remains an error even after you read this article.
The novelty effect they talk about, which the article says is probably simple reversion to the mean, is -- in my opinion -- likely a true observation of the state of the world. You can watch your conversion-rate-over-time for many offers, many designs, many products, etc, and they often start out quite high and taper off, both in circumstances where there is obvious alternate causality and in circumstances where they isn't. By comparison, I have not often participated in tests where conversion rates started out abnormally low and reverted to the mean, which we'd expect exactly as often as "started out high" if that was indeed what we were seeing.
I believe so strongly in the novelty effect that I have written proposals to profitably exploit it by scalably manufacturing novelty. Sadly, none of them are public. It's on my to-do list for one of these months but a lot of things are on my to-do list for one of these months.
If you run many tests, which as time approaches infinity you darn better, your odds of seeing a false positive approach one. Contra the article, you gladly accept this as a cost of doing business, because you know to a statistical certainty that you've seen many, many more true positives.
That about sums it up. If you have any particular questions, happy to answer them. My takeaway is "Good article. Please don't use it to justify a decision to not test."
Of course, these are very important issues for A/B testing vendors like us to understand and fix, since users mostly rely on our calculations to base their decisions. You will see us working towards taking care of such issues.
What? That's not right at all! A confidence measure is how much you can trust that there's actually a difference. You can't say it'll improve things if your confidence is lower than your original threshold!
In addition to this, every time you change something you:
1) Might introduce bugs
2) Spend money
3) Spend time you could be spending adding a new feature or getting a new customer
A 95% confidence doesn't magically translate into a binary decision of winner v/s no decision. A 90% confidence means that the variation is more likely to be better than control, but of course not as likely if confidence was 95%. The p-value is an arbitrary cut-off. (A p-value of 0.945 shouldn't make you throw your results) Of course, in fields such as clinical trials, you'd want to be very sure of your results and might not want to take chances, but on the web when you're running many tests, you are usually OK with something that is probable to work better than the existing version.
Of course, if it is a high stakes A/B test on the web, you'd be as careful as a clinical trial design. We're working towards making all those techniques available within the tool itself.
It drives me batty when people tell me their "average" conversion rate is 1% after running a $25 ad campaign with so few clicks. It seems like too many folks are just oblivious to sample size, confidence interval, and power calculations -- something that could be solved with a quick Wikipedia search .