At my first webdev internship, my only job was to report to the "Head of Analytics" (a young liberal arts guy). All I did all day was make the tweaks he told me to do. It was stuff like "make this button red, green or blue", or "try these three different phrasings".
We got no more than 100 hits a day, with no more than 2-3 conversions a day, and he would run these tests for, like, 2 days.
I hated it, and the website looked horrible because everything was competing with each other and just used whatever random color won.
I've seen that, too. One of my clients redid their marketing site 3x in one year, each time claiming incredible improvements. The incredible improvements turned out to be local hill climbing, while the entire site's performance languished... 3-4 years ago there were a ton of blog posts about how a green button produced incredible sales when compared to a red button. And so everyone switched to green buttons...
By contrast, I've evolved multiple websites through incremental, globally measured, optimizations. It's a lot of fun and it requires you to really understand your user (I've called AB testing+analytics "a conversation between you and your users"). But, as you point out, it can be tough to get statistically relevant data on changes to a small site. That's why I usually focused on big effects (e.g. 25%), rather than on the blog posts about "OMG! +2.76% change in sales!". That's also why I did a lot of "historical testing", under the assumption that week-week changes in normalized stats would be swamped by my tests.
under the assumption that week-week changes in normalized stats would be swamped by my tests
This is an enormously problematic assumption, which you can verify by either looking at the week-to-week stats for a period prior to you joining the company, or (for a far more fun demonstration) doing historical testing of the brand of toothpaste you use for the next 6 weeks. Swap from Colgate to $NAME_ANOTHER_BRAND, note the improvement, conclude that website visitors pay an awful lot of attention to the webmaster's toothpaste choices.
Full disclosure: I work for Qubit who published this white paper.
This kind of "historical testing" (I think people often call it sequential testing?) can be pretty dangerous even for large effects. For example Christmas might be a really good time to change the colour of all the buttons on your site and see a 50% increase in sales.
Yes. This kind of micro-A/B testing ("red or green buttons?") feels analogous to premature optimization when coding. Don't worry about the tiny 0.0001% improvements you get from using a for-loop over a while-loop; improve the algorithm itself for order-of-magnitude changes. Focus on the big picture.
There are a lot of published "case studies" in the internet marketing field that consist of a few hundred views and a handful of conversions. It is even more embarrassing considering you often need 100,000+ unique visitors and thousands of conversions to find real winners.. and you still have to deal with a reality that a real 'winner' may result in a drop off of sales (in lead generation), an increase in charge backs (if your conversion was a sale), etc. This accounts for a sliver of the regression to the mean mentioned in the whitepaper.
Tests have value, but just making your site/app very simple and completely non-confusing to the viewer can do something years of split tests will not.
I suggest running tests and monitoring metrics as you implement design changes, not so much as a magic eight ball, but to ensure you avoid truly catastrophic UI fuck ups.
Full disclosure: I work for Qubit who published this white paper.
I see a lot of this kind of testing going on in the industry and it's frustrating. A/B testing can be a massive tool for your business if it's done right but obviously if you only wait for 2-3 conversions, you're not learning much... "Good" to hear that other people feel the same way!
Well it depends on your conversion rate. But assuming it's ~few percent, then yes it will be hard to measure anything other than very very large effects in conversion rate unless you're willing to wait quite a long time.
You can measure more upper-funnel things though like I said in the GP, which can be very helpful, especially in combination with qualitative feedback, although this depends on what exactly your business is...
I'm someone currently specializing in analytics as a digital marketer at work (and learning R and a bit of Python in my spare time for greater and swifter data analysis!) Similar to your former superior, I'm also coming out of a liberal arts background. I just want to make it clear that someone like me, despite their background, agrees with you that the person you were reporting to was foolish to even bother A/B testing such minor elements at 100 hits/day.
Sadly, many foolish "SEOs"/"digital marketers"/"growth hackers" have this same mentality that such subtle changes--despite low traffic--still offer meaningful information to digest and further analyze. But hey, they gotta keep their boss/clients on-board for the thrill and payment, right? For everyone out there, remember that often outside the highest echelon of traffic levels, this testing is often performed by marketers with BAs in business administration, marketing, or liberal arts degrees like me. They are often not the statisticians referenced in this document. And sadly they may likely be people unlike me, unwilling to stretch out into a programming languages for data analysis and may have never cracked open a book on statistics. But frankly they have other things to worry about--like staying in your budget and overall digital branding and marketing strategy. Their budget and time is likely better applied outside of A/B testing.
If you have a mathematics background, reach out to your marketing department. If you consider yourself a math-wiz, reach out to the "growth hacker" or "SEO" a few feet away. They deal with the stuff you don't want to deal with. You deal with the stuff they don't want to deal with. Help each other out and engage in a conversation to better help your business. At least your superiors would appreciate it.
Personally, when it comes to landing pages, I test much more dramatic shifts--significant changes to the entire design or to the header imagery along with call-to-action. I don't buy into the testing of slight adjustments to things like font size or button color (and especially when there is such so low volume). That said, I've never worked with hundreds of thousands of visitors per month on a site, where anyone would imagine smaller changes for testing can make a bit more sense to look into.
gkoberger, I'm sorry you hated your first webdev internship. I would have hated it too.
On a side note (making specific reference to the document instead of the comment!), I really enjoyed point #3. This speaks very much to the often short-lived A/B testing of low-volume AdWords text ads. The data is often ALL over the place despite the (otherwise) "professional" use of the platform.
There are also some complex problems with assumptions that are infrequently addressed, e.g. maybe if a regular user sees a structural/cosmetic change he is more likely to look at that and click it, while that would fade away in steady-state.
confidence in your own decisions can also be referred to as a Bayesian prior ;)
I've treated the A/B tests I've run pretty much as a case of Bayesian parameter estimation (where the true conversion of A and of B are your parameter). You then get nice beta distributions you can sample from, as well as use the prior to constrain expectations of improvement and also reduce the effects of early flukes in your sampling.
Bayesian approaches are probably out of grasp for most small companies. They have a long way to go before being as approachable and easy as frequentist approaches. Schools and the statistics field as a whole need drastic reformation in introductory course offerings that are taken by everyone.
Until then, it's A/B, p value <.05, ignore bias and sample size for companies who aren't large enough to have a statistician or data scientist.
No they aren't. Here is a Bayesian method that is just as easy as any Frequentist one. At my last job, a completely non-technical user who didn't even understand statistical significance used it just fine .
The only cost of the Bayesian method is that the bayesian python script is thousands of times slower than the frequentist one. I didn't do benchmarks, but in terms of order of magnitude, the frequentist method might take 1 microsecond while the Bayesian method might take 1 second.
 He used a less advanced version of the method which used a normal approximation - not that he needed to know the difference.
There are several things that help. Firstly you're not just looking for a red light/green light significance. Since you're actually modeling the beta distribution for each conversion rate you not only can ask "what's the probability that this test is an improvement?" you can actually sample from both distributions and see what that improvement looks like.
For example I just simulated some bad data. A has 480 observations and a mean conversion of 33%, B has 410 observations and has a mean conversion of 37%. The p-value here is 0.0323 In the traditional A/B testing model we'd be done and claiming better than a 10% improvement!
However when I sample from these 2 beta distributions I see that my credible region is -2% to 34% meaning this new test could be anywhere from 2% worse to 34% better. No magic value is needed to tell you that you really don't know anything yet.
Another huge help is the use of a prior. Until your data overrides your prior belief you aren't going to see anything. Going with the last example, if I had a good prior that the true conversion rate on that page was actually 33% I wouldn't have even gotten a p-value of less then 0.05. On the other hand if I had a strong prior that the conversion rate was 50% that would imply that both A and B were getting strangely unlucky results, which would actually boost the probability that B was in fact an improvement.
On the philosophical side, Bayesian statistics are simply trying to quantify what you know, not give you 'yes'/'no' answers. Maybe the gamble of -2 to 34 is good for you, or maybe you really want to know tighter bounds on your improvement and aren't comfortable with any possibility of decline. Bayesian statistics gives you a direct way to trade off certainty with time.
Full disclosure: I work for Qubit who published this white paper.
Just wanted to add that if you have less than a million users you can A/B test for upper funnel goals, effectively measuring if changes improve engagement. Obviously then you have the problem of working out if the engagement translates into more sales but perhaps you're willing to wait longer to find out if a test that improves engagement leads to more revenue in the long run.
For simple tests you can reverse the mathematics to get good estimates of how many observations are needed given a goal for your desired power and tolerance for false positives. Asking for greater power makes your test more sensitive ("buys a bigger telescope") at the cost of increased sample size. Asking for fewer false positives ("cleaning the lenses") costs similarly.
For more sophisticated tests, ones less likely to be seen in an A/B scenario, you might not be able to reverse the mathematics and get a direct answer, so often people will run simulation studies to guess at the needed sample size.
Nothing's "wrong" with 89.99% - 90% is simply a preferred value. The point of the exercise is not to pick a magic number, but to pick A number and therefore a target sample size before collecting results.
Without something like this, it's very easy to fall into the "we'll collect data until something that looks significant appears" trap.
In practice, doing this will lead to what the author talks about under "Stopping tests as soon as you see winning results will create false positives." It doesn't make your data invalid, but it will generally lead to poor methodology if you are not careful.
consider fair coin flip sequences. stopping at arbitrary points of your choice can never really affect anything serious. but you need to make sure you're doing math right, meaning if you stop when something gets a lead you don't use incorrect math that says the coin is biased.
data is data. as long as the method being used to collect individual data points is fine, and they are collected independently, then the data you get as a result is gonna be OK, the rest (like arbitrary stop time) doesn't ruin it. you just have to avoid bad math.
what ruins data is stuff like throwing 10% of the heads results in the trash or using other approaches in which data can be selectively discarded or not discarded. so just stopping arbitrarily can be a problem if you might never stop and throw out the results if you don't like them. but if you do something like "stop after 1 million data points max, or when i feel like it earlier" then your data is still OK because it cannot get selectively ignored.
stopping earlier cannot make a fair coin look unfair or anything like that.
this is not some random unknown position that flies in the face of how actual statistics works. something like this is the standard bayesian position, and i think it's true. (i strongly object to bayesian epistemology, but i think bayesian statistics is correct).
not ALL stopping rules are OK but lots are. you don't HAVE to use simple ones like "gather X data points, stop".
> And then there's the Bayesian reply: "Excuse you? The evidential impact of a fixed experimental method, producing the same data, depends on the researcher's private thoughts? And you have the nerve to accuse us of being 'too subjective'?"
Tests are calibrated on their false positive (alpha) and false negative rates (beta). If you have a lot of financial/upside/pain information then you can start to determine the relative pain of each of those kinds of failures and calibrate accordingly. At the end of the day the best choice is some complex function of the cost of false positives, the cost of false negatives, the cost of each new observation (which is probably non-linear), the upside of a discovery, and the prior likelihood of finding a discovery.
Which is to say you can definitely pick better choices for alpha, but it's really hard so everyone just picks whatever their field agrees is "OK". In science it's often 95%.
In my experience it can't be overstated how important it is to wait until you have a large sample size to decide whether a variation is the winner. Nearly all of the A/B tests I run start out looking like a variation is the clear, landslide winner (sometimes showing 100%+ improvement over the original) only to eventually end up regressing toward the mean. I can't get a clear idea of the winner of a test until I've shown the variation(s) to 10s of thousands of visitors and received a few thousand conversions. I've also learned that it's important to only perform tests on new visitors when possible. That means tests need to run longer to get the appropriate sample size. If you're testing over a few hundred conversions and performing tests on new and returning visitors then you're probably getting skewed results. Again, that's just in my experience so far. YMMV. One thing to consider with a test is that the variations may be too subtle to have a significant, positive impact on conversion.
While the OP's article targets some low-hanging fruit, like halting criteria, multiple hypotheses, etc. which should be familiar to anyone serious about bioinformatics and statistics, Ioannidis takes these things a little farther and comes up with a number of corollaries that apply equally well to A/B testing.
After all, the randomized controlled trials that the FDA uses to approve new drugs are essentially identical to what would be called an A/B test on Hacker News.
This is awesome, thanks for the link! (And the visualizations help a ton, especially for the t-test... it's been a while since I took any stats courses and the terminology always puts me off a bit but the graphs make sense.)
thanks but what does "expected conversion rate" mean exactly? it isn't defined and I couldn't find that term anywhere else on the site.
ah, ok, got it. but why is their default expected conversion rate set so high? sheeesh
Putting aside bandits and all that, it seems like the first step should be to set up a hierarchical prior which performs shrinkage. Multiple comparisons and stopping issues are largely due to using frequentist tests rather than a simple probabilistic model and inference that conditions and the observed data.
> We know that that, occasionally, a test will generate a
> false positive due to random chance - we can’t avoid that.
> By convention we normally fix this probability at 5%. You
> might have heard this called the significance probability
> or p-value.
> If we use a p-value cutoff of 5% we also expect to see 5
> false positives.
Am I reading this incorrectly, or is the author describing p-values incorrectly?
A p-value is the chance a result at least as strong as the observed result would occur if the null hypothesis is true. You can't "fix" this probability at 5%. You can say "results with a p-value below 5% are good candidates for further testing". The fact that p-values of 0.05 and below are often considered significant in academia tells you nothing about the probability of a false positive occurring in an arbitrary test.
I don't follow. Why would we expect 5% of those 90 cases to be false positives, and what relationship does the estimate of 5% have to p-value? I don't understand how p-value could ever be used to predict the number of false positives one would expect to observe in a bundle of arbitrary tests.
> A p-value cutoff of 5% says that you have a 5% probability that you're wrong in rejecting the Null Hypothesis.
I don't think this is right. A p-value cutoff of 0.05 doesn't, by itself, indicate anything about the underlying probability of incorrectly rejecting the null hypothesis. It tells you, in a test that meets your cutoff, if the null hypothesis is true, the chance of seeing a results as strong or stronger than the results observed in the test is 5% or less. But that can't tell you the chance you're wrong in rejecting the null hypothesis.
A 1% chance of seeing results as strong as your results if the null hypothesis is true does not mean that there's a 99% chance of the null hypothesis being false.
Regardless, even putting this disagreement to one side, I still don't see how the original author's point makes sense. He or she seems to be using the cutoff as an indication of the underlying false-positive probability for prospective tests, regardless of the results of those tests meet the cutoff or not.
gabemart, you're right, a_bonobo, ronaldx, you guys are wrong. p-values are commonly misunderstood to mean that the result has "5% change of being wrong". That's not what a p-value is. Please go ahead and read the 'misunderstandings' section on p-values in wikipedia.
Author of the paper here. You're right this is incorrect. I corrected this in the final copy but a earlier draft seems to have been put on the website. There are a few other errors too.
I am describing the 'significance level' here not the 'p-value', as you say.
"Let’s imagine we perform 100 tests on a website and, by running each test for 2 months, we have a large enough sample to achieve 80% power. 10 out of our 100 variants will be truly effective and we expect to detect 80%, or 8, of these true effects.
If we use a p-value cutoff of 5% we also expect to see 5 false positives. So, on average, we will see 8+5 = 13 winning results from 100 A/B tests."
If we expect 10 truly effective tests and 5 false positives, we'd have 15 tests that rejected the null hypothesis of h_0=h_test. Taking power into account, shouldn't we see 15*0.8, 12 winning results from the results? I.e. wouldn't one of the false positives also have not-enough-power?
The "regression to the mean" and "novelty" effect is getting at two different things (both true, both important).
1. Underpowered tests are likely to exaggerate differences, since E(abs(truth - result)) increases as the sample size shrinks.
2. The much bigger problem I've seen a lot: when users see a new layout they aren't accustomed to they often respond better, but when they get used to it, they can begin responding worse than with the old design. Two ways to deal with this are long term testing (let people get used to it) and testing on new users. Or, embrace the novelty effect and just keep changing shit up to keep users guessing - this seems to be FB's solution.
What bothers me about A/B tests is when people say, eg."there was a 7% improvement" without telling us the sample size, or error margin. I'd rather hear: On a sample size of 1,000 unique visits, the improvement rate was 7% +/- 4%
I really liked this; it's condescending, but in a good natured sort of way. It's as if the author was trying to explain really basic statistics to a marketer, then realized that the marketer had NO idea what he was talking about.
So you get statements like "This is a well-known phenomenon, called ‘regression to the mean’ by statisticians. Again, this is common knowledge among statisticians but does not seem to be more widely known."
Correct on the math, to the limit of my understanding of it and quick glance.
I am agnostic about whether most A/B testing practitioners administer their tests correctly -- of the universe of companies I've seen, far and away the most common error regarding A/B testing is "We don't A/B test.", which remains an error even after you read this article.
The novelty effect they talk about, which the article says is probably simple reversion to the mean, is -- in my opinion -- likely a true observation of the state of the world. You can watch your conversion-rate-over-time for many offers, many designs, many products, etc, and they often start out quite high and taper off, both in circumstances where there is obvious alternate causality and in circumstances where they isn't. By comparison, I have not often participated in tests where conversion rates started out abnormally low and reverted to the mean, which we'd expect exactly as often as "started out high" if that was indeed what we were seeing.
I believe so strongly in the novelty effect that I have written proposals to profitably exploit it by scalably manufacturing novelty. Sadly, none of them are public. It's on my to-do list for one of these months but a lot of things are on my to-do list for one of these months.
If you run many tests, which as time approaches infinity you darn better, your odds of seeing a false positive approach one. Contra the article, you gladly accept this as a cost of doing business, because you know to a statistical certainty that you've seen many, many more true positives.
That about sums it up. If you have any particular questions, happy to answer them. My takeaway is "Good article. Please don't use it to justify a decision to not test."
The article is spot on. We at http://visualwebsiteoptimizer.com/ know that there are some biases (particularly related to 'Multiple comparisions' and 'Multiple seeing of data') that lead of results that seem better than they actually are. Though the current results are not wrong. They are directionally correct, and with most A/B tests even if 95% confidence is really a true confidence of 90% or less, the business will still do better implementing the variation (v/s not doing anything).
Of course, these are very important issues for A/B testing vendors like us to understand and fix, since users mostly rely on our calculations to base their decisions. You will see us working towards taking care of such issues.
> They are directionally correct, and with most A/B tests even if 95% confidence is really a true confidence of 90% or less, the business will still do better implementing the variation (v/s not doing anything).
What? That's not right at all! A confidence measure is how much you can trust that there's actually a difference. You can't say it'll improve things if your confidence is lower than your original threshold!
In addition to this, every time you change something you:
1) Might introduce bugs
2) Spend money
3) Spend time you could be spending adding a new feature or getting a new customer
> What? That's not right at all! A confidence measure is how much you can trust that there's actually a difference. You can't say it'll improve things if your confidence is lower than your original threshold!
A 95% confidence doesn't magically translate into a binary decision of winner v/s no decision. A 90% confidence means that the variation is more likely to be better than control, but of course not as likely if confidence was 95%. The p-value is an arbitrary cut-off. (A p-value of 0.945 shouldn't make you throw your results) Of course, in fields such as clinical trials, you'd want to be very sure of your results and might not want to take chances, but on the web when you're running many tests, you are usually OK with something that is probable to work better than the existing version.
Of course, if it is a high stakes A/B test on the web, you'd be as careful as a clinical trial design. We're working towards making all those techniques available within the tool itself.
I'm afraid that's not quite right. A simple python simulation will show you that a variant with -5% (ie NEGATIVE) uplift will still give a positive results around 10% of the time if you perform early stopping of the test.
To remove all doubt, your interpretation of the statistics is incorrect. In particular this sentence is demonstrably false: "They are directionally correct, [...] the business will still do better implementing the variation (v/s not doing anything)."
Related... someone should write a good article about estimating customer acquisition costs (CAC, or ROI if you prefer) based on conversion rates of ads.
It drives me batty when people tell me their "average" conversion rate is 1% after running a $25 ad campaign with so few clicks. It seems like too many folks are just oblivious to sample size, confidence interval, and power calculations -- something that could be solved with a quick Wikipedia search .
Regarding the final bullet point of doing a second validation, the sample size should be bigger right? Because of the tendency for winners to coincide with +ve random effects, you will choose a larger experiment size and expect to see a lesser result.
Visibility on this is set to "Private" is is really supposed to be linked publically on HN? I was about to Tweet a link to it and then I felt dirty, like maybe the author wanted to send the link to just a select group.