Hacker News new | past | comments | ask | show | jobs | submit login
Most Winning A/B Test Results are Illusory [pdf] (qubitproducts.com)
158 points by ernopp on Feb 23, 2014 | hide | past | favorite | 83 comments

At my first webdev internship, my only job was to report to the "Head of Analytics" (a young liberal arts guy). All I did all day was make the tweaks he told me to do. It was stuff like "make this button red, green or blue", or "try these three different phrasings".

We got no more than 100 hits a day, with no more than 2-3 conversions a day, and he would run these tests for, like, 2 days.

I hated it, and the website looked horrible because everything was competing with each other and just used whatever random color won.

I've seen that, too. One of my clients redid their marketing site 3x in one year, each time claiming incredible improvements. The incredible improvements turned out to be local hill climbing, while the entire site's performance languished... 3-4 years ago there were a ton of blog posts about how a green button produced incredible sales when compared to a red button. And so everyone switched to green buttons...

By contrast, I've evolved multiple websites through incremental, globally measured, optimizations. It's a lot of fun and it requires you to really understand your user (I've called AB testing+analytics "a conversation between you and your users"). But, as you point out, it can be tough to get statistically relevant data on changes to a small site. That's why I usually focused on big effects (e.g. 25%), rather than on the blog posts about "OMG! +2.76% change in sales!". That's also why I did a lot of "historical testing", under the assumption that week-week changes in normalized stats would be swamped by my tests.

under the assumption that week-week changes in normalized stats would be swamped by my tests

This is an enormously problematic assumption, which you can verify by either looking at the week-to-week stats for a period prior to you joining the company, or (for a far more fun demonstration) doing historical testing of the brand of toothpaste you use for the next 6 weeks. Swap from Colgate to $NAME_ANOTHER_BRAND, note the improvement, conclude that website visitors pay an awful lot of attention to the webmaster's toothpaste choices.

Full disclosure: I work for Qubit who published this white paper.

This kind of "historical testing" (I think people often call it sequential testing?) can be pretty dangerous even for large effects. For example Christmas might be a really good time to change the colour of all the buttons on your site and see a 50% increase in sales.

Yes. This kind of micro-A/B testing ("red or green buttons?") feels analogous to premature optimization when coding. Don't worry about the tiny 0.0001% improvements you get from using a for-loop over a while-loop; improve the algorithm itself for order-of-magnitude changes. Focus on the big picture.

Can you expand? does globally measured optimisations mean the whole site saw a 1% rise after we did x? why is that different to a/B testing?

There are a lot of published "case studies" in the internet marketing field that consist of a few hundred views and a handful of conversions. It is even more embarrassing considering you often need 100,000+ unique visitors and thousands of conversions to find real winners.. and you still have to deal with a reality that a real 'winner' may result in a drop off of sales (in lead generation), an increase in charge backs (if your conversion was a sale), etc. This accounts for a sliver of the regression to the mean mentioned in the whitepaper.

Tests have value, but just making your site/app very simple and completely non-confusing to the viewer can do something years of split tests will not.

I suggest running tests and monitoring metrics as you implement design changes, not so much as a magic eight ball, but to ensure you avoid truly catastrophic UI fuck ups.

Full disclosure: I work for Qubit who published this white paper.

I see a lot of this kind of testing going on in the industry and it's frustrating. A/B testing can be a massive tool for your business if it's done right but obviously if you only wait for 2-3 conversions, you're not learning much... "Good" to hear that other people feel the same way!

So if my site has 10k hits/month and I don't have a year, am I left to "stick with my gut" and qualitative user tests?

You can still run successful tests. Here's a handy calculator you can use to approximate a required sample size for your test: http://www.evanmiller.org/ab-testing/sample-size.html

Well it depends on your conversion rate. But assuming it's ~few percent, then yes it will be hard to measure anything other than very very large effects in conversion rate unless you're willing to wait quite a long time.

You can measure more upper-funnel things though like I said in the GP, which can be very helpful, especially in combination with qualitative feedback, although this depends on what exactly your business is...

I'm someone currently specializing in analytics as a digital marketer at work (and learning R and a bit of Python in my spare time for greater and swifter data analysis!) Similar to your former superior, I'm also coming out of a liberal arts background. I just want to make it clear that someone like me, despite their background, agrees with you that the person you were reporting to was foolish to even bother A/B testing such minor elements at 100 hits/day.

Sadly, many foolish "SEOs"/"digital marketers"/"growth hackers" have this same mentality that such subtle changes--despite low traffic--still offer meaningful information to digest and further analyze. But hey, they gotta keep their boss/clients on-board for the thrill and payment, right? For everyone out there, remember that often outside the highest echelon of traffic levels, this testing is often performed by marketers with BAs in business administration, marketing, or liberal arts degrees like me. They are often not the statisticians referenced in this document. And sadly they may likely be people unlike me, unwilling to stretch out into a programming languages for data analysis and may have never cracked open a book on statistics. But frankly they have other things to worry about--like staying in your budget and overall digital branding and marketing strategy. Their budget and time is likely better applied outside of A/B testing.

If you have a mathematics background, reach out to your marketing department. If you consider yourself a math-wiz, reach out to the "growth hacker" or "SEO" a few feet away. They deal with the stuff you don't want to deal with. You deal with the stuff they don't want to deal with. Help each other out and engage in a conversation to better help your business. At least your superiors would appreciate it.

Personally, when it comes to landing pages, I test much more dramatic shifts--significant changes to the entire design or to the header imagery along with call-to-action. I don't buy into the testing of slight adjustments to things like font size or button color (and especially when there is such so low volume). That said, I've never worked with hundreds of thousands of visitors per month on a site, where anyone would imagine smaller changes for testing can make a bit more sense to look into.

gkoberger, I'm sorry you hated your first webdev internship. I would have hated it too.

On a side note (making specific reference to the document instead of the comment!), I really enjoyed point #3. This speaks very much to the often short-lived A/B testing of low-volume AdWords text ads. The data is often ALL over the place despite the (otherwise) "professional" use of the platform.

Ahh glorious...

I love the concept of A/A testing here, illustrating that you get apparent results even when you compare something to itself.

I can't imagine how A/B tests are a productive use of time for any site with less than a million users.

There are so many more useful things you could be doing to create value. If you're running a startup you should rather have some confidence in your own decisions.

When ExP was a thing at Microsoft, we always ran an A/A test before we did experiments. We'd also do an A/A/B test to make sure the actual experiments were working.


There are also some complex problems with assumptions that are infrequently addressed, e.g. maybe if a regular user sees a structural/cosmetic change he is more likely to look at that and click it, while that would fade away in steady-state.

True. We did everything we could to account for that, though. Someone chooses to clear cookies every time they load a page - not much we could do.

A/A testing should be used to get accurate estimates for within-sample variance. If you run an A/A/B test then you can calibrate the A/B component to be sensitive w.r.t. the tolerances of real data.

And then yeah, I'm sure a lot of successful A/B tests will get washed.

confidence in your own decisions can also be referred to as a Bayesian prior ;)

I've treated the A/B tests I've run pretty much as a case of Bayesian parameter estimation (where the true conversion of A and of B are your parameter). You then get nice beta distributions you can sample from, as well as use the prior to constrain expectations of improvement and also reduce the effects of early flukes in your sampling.

Bayesian approaches are probably out of grasp for most small companies. They have a long way to go before being as approachable and easy as frequentist approaches. Schools and the statistics field as a whole need drastic reformation in introductory course offerings that are taken by everyone.

Until then, it's A/B, p value <.05, ignore bias and sample size for companies who aren't large enough to have a statistician or data scientist.

No they aren't. Here is a Bayesian method that is just as easy as any Frequentist one. At my last job, a completely non-technical user who didn't even understand statistical significance used it just fine [1].


The only cost of the Bayesian method is that the bayesian python script is thousands of times slower than the frequentist one. I didn't do benchmarks, but in terms of order of magnitude, the frequentist method might take 1 microsecond while the Bayesian method might take 1 second.

[1] He used a less advanced version of the method which used a normal approximation - not that he needed to know the difference.

Sorry, but I don't understand how Bayesian statistics could possibly solve the problems described here.

Sometimes bad scenarios will get good results, by luck, and sometimes good scenarios will get bad results, by luck.

Using more advanced statistical methods doesn't change that these cases are fundamentally indistinguishable.

You're right. The one exception though is with Bayesian statistics you can estimate an effect size using your experiment results using a credibility interval.

If the differences are drastic enough you can still get value from split testing. Incremental changes are just probably not going to bring you much luck.

There are several things that help. Firstly you're not just looking for a red light/green light significance. Since you're actually modeling the beta distribution for each conversion rate you not only can ask "what's the probability that this test is an improvement?" you can actually sample from both distributions and see what that improvement looks like.

For example I just simulated some bad data. A has 480 observations and a mean conversion of 33%, B has 410 observations and has a mean conversion of 37%. The p-value here is 0.0323 In the traditional A/B testing model we'd be done and claiming better than a 10% improvement!

However when I sample from these 2 beta distributions I see that my credible region is -2% to 34% meaning this new test could be anywhere from 2% worse to 34% better. No magic value is needed to tell you that you really don't know anything yet.

Another huge help is the use of a prior. Until your data overrides your prior belief you aren't going to see anything. Going with the last example, if I had a good prior that the true conversion rate on that page was actually 33% I wouldn't have even gotten a p-value of less then 0.05. On the other hand if I had a strong prior that the conversion rate was 50% that would imply that both A and B were getting strangely unlucky results, which would actually boost the probability that B was in fact an improvement.

On the philosophical side, Bayesian statistics are simply trying to quantify what you know, not give you 'yes'/'no' answers. Maybe the gamble of -2 to 34 is good for you, or maybe you really want to know tighter bounds on your improvement and aren't comfortable with any possibility of decline. Bayesian statistics gives you a direct way to trade off certainty with time.

Full disclosure: I work for Qubit who published this white paper.

Just wanted to add that if you have less than a million users you can A/B test for upper funnel goals, effectively measuring if changes improve engagement. Obviously then you have the problem of working out if the engagement translates into more sales but perhaps you're willing to wait longer to find out if a test that improves engagement leads to more revenue in the long run.

I do this professionally as my sole job. This is one of the very few papers I've read that seem completely legit to me. I especially love their point on necessary sample sizes to get to a 90% power.

How do you calculate the correct sample size for a test, to achieve the correct "power"?

For binomial scenarios like a stock A/B test, most statistical environments will have some sort of built-in power functions. For example, R does; an example: http://www.gwern.net/AB%20testing#power-analysis

For simple tests you can reverse the mathematics to get good estimates of how many observations are needed given a goal for your desired power and tolerance for false positives. Asking for greater power makes your test more sensitive ("buys a bigger telescope") at the cost of increased sample size. Asking for fewer false positives ("cleaning the lenses") costs similarly.

For more sophisticated tests, ones less likely to be seen in an A/B scenario, you might not be able to reverse the mathematics and get a direct answer, so often people will run simulation studies to guess at the needed sample size.

Here are a few resources: http://statpages.org/#Power

I use simulations so I can avoid math and analyze the power of any distribution I can model in a consistent manner.

I wrote something about it here: http://www.databozo.com/2013/10/12/Finding_a_sample_size_for...

EDIT: typo

Lehr's Equation (Rule of 16) is generally a good estimator. It is explained (but not referenced) in a similar article here: http://www.evanmiller.org/how-not-to-run-an-ab-test.html

Why is 90% power the magic number? What's wrong with 89.99%? Or 99.99%?

Nothing's "wrong" with 89.99% - 90% is simply a preferred value. The point of the exercise is not to pick a magic number, but to pick A number and therefore a target sample size before collecting results.

Without something like this, it's very easy to fall into the "we'll collect data until something that looks significant appears" trap.

Wait, nothing is wrong with collecting data until you see something significant. In fact, that's the better way to do things. You just need to make sure your math acknowledges the procedure.

In practice, doing this will lead to what the author talks about under "Stopping tests as soon as you see winning results will create false positives." It doesn't make your data invalid, but it will generally lead to poor methodology if you are not careful.

> You just need to make sure your math acknowledges the procedure.

I'd love to hear some clarification on how "[making] sure your math acknowledges" stopping randomized tests early works, because it flies in a face of how actual statistics works.

> it flies in a face of how actual statistics works.

The terms you're looking for are 'sequential testing', 'optimal stopping', and 'adaptive trials'.

consider fair coin flip sequences. stopping at arbitrary points of your choice can never really affect anything serious. but you need to make sure you're doing math right, meaning if you stop when something gets a lead you don't use incorrect math that says the coin is biased.

data is data. as long as the method being used to collect individual data points is fine, and they are collected independently, then the data you get as a result is gonna be OK, the rest (like arbitrary stop time) doesn't ruin it. you just have to avoid bad math.

what ruins data is stuff like throwing 10% of the heads results in the trash or using other approaches in which data can be selectively discarded or not discarded. so just stopping arbitrarily can be a problem if you might never stop and throw out the results if you don't like them. but if you do something like "stop after 1 million data points max, or when i feel like it earlier" then your data is still OK because it cannot get selectively ignored.

stopping earlier cannot make a fair coin look unfair or anything like that.

this is not some random unknown position that flies in the face of how actual statistics works. something like this is the standard bayesian position, and i think it's true. (i strongly object to bayesian epistemology, but i think bayesian statistics is correct).

not ALL stopping rules are OK but lots are. you don't HAVE to use simple ones like "gather X data points, stop".

see stuff like:



> And then there's the Bayesian reply: "Excuse you? The evidential impact of a fixed experimental method, producing the same data, depends on the researcher's private thoughts? And you have the nerve to accuse us of being 'too subjective'?"

By "acknowledge" you probably mean apply an SPRT (serial probability ratio test). The Neymann-Pearson lemma is also relevant.

http://www.evanmiller.org/how-not-to-run-an-ab-test.html gives a good explanation about how this can ruin your tests.

He even says this is possible via Bayesian experiment design:

FTA: Bayesian experiment design: With Bayesian experiment design you can stop your experiment at any time and make perfectly valid inferences.

Setting a 90% power target is a way to decide when you have something significant.

Tests are calibrated on their false positive (alpha) and false negative rates (beta). If you have a lot of financial/upside/pain information then you can start to determine the relative pain of each of those kinds of failures and calibrate accordingly. At the end of the day the best choice is some complex function of the cost of false positives, the cost of false negatives, the cost of each new observation (which is probably non-linear), the upside of a discovery, and the prior likelihood of finding a discovery.

Which is to say you can definitely pick better choices for alpha, but it's really hard so everyone just picks whatever their field agrees is "OK". In science it's often 95%.

It's not, you can use whatever power you want.

This article's title echoes a paper which continues to influence the medical research and bioinformatics community, "Why Most Published Research Findings Are False" by JPA Ioannidis.


While the OP's article targets some low-hanging fruit, like halting criteria, multiple hypotheses, etc. which should be familiar to anyone serious about bioinformatics and statistics, Ioannidis takes these things a little farther and comes up with a number of corollaries that apply equally well to A/B testing.

After all, the randomized controlled trials that the FDA uses to approve new drugs are essentially identical to what would be called an A/B test on Hacker News.

I strongly recommend using Evan Miller's free A/B testing tools to avoid those issues!

Use them to really know if conversion rate is significantly different, whether the mean value of two groups is significantly different and how to calculate sample size:


This is awesome, thanks for the link! (And the visualizations help a ton, especially for the t-test... it's been a while since I took any stats courses and the terminology always puts me off a bit but the graphs make sense.)

thanks but what does "expected conversion rate" mean exactly? it isn't defined and I couldn't find that term anywhere else on the site. EDIT: ah, ok, got it. but why is their default expected conversion rate set so high? sheeesh

most people have conversion rates between 1-3%

Putting aside bandits and all that, it seems like the first step should be to set up a hierarchical prior which performs shrinkage. Multiple comparisons and stopping issues are largely due to using frequentist tests rather than a simple probabilistic model and inference that conditions and the observed data.

Gelman et al, "Why we (usually) don't have to worry about multiple comparisons" http://arxiv.org/abs/0907.2478

  > We know that that, occasionally, a test will generate a
  > false positive due to random chance - we can’t avoid that.
  > By convention we normally fix this probability at 5%. You
  > might have heard this called the significance probability
  > or p-value.

  > If we use a p-value cutoff of 5% we also expect to see 5
  > false positives.
Am I reading this incorrectly, or is the author describing p-values incorrectly?

A p-value is the chance a result at least as strong as the observed result would occur if the null hypothesis is true. You can't "fix" this probability at 5%. You can say "results with a p-value below 5% are good candidates for further testing". The fact that p-values of 0.05 and below are often considered significant in academia tells you nothing about the probability of a false positive occurring in an arbitrary test.

Author of the paper here. You're right this is incorrect. I corrected this in the final copy but a earlier draft seems to have been put on the website. There are a few other errors too. I am describing the 'significance level' here not the 'p-value', as you say.

is the corrected final version uploaded at the same URL? I'd like to distribute to some colleagues.

Just to let you know it's been updated.

Yes, there's perhaps a small error, although it might be that he's rounded up in his favour.

In his described scenario there are 90 cases where the null hypothesis is true (he states as a premise: "10 out of our 100 variants will be truly effective").

So strictly, we expect to see 5% of 90 = an average of 4.5 false positives (he says 5 false positives).

[Edited to add: False positive rate is measured as a conditional probability https://en.wikipedia.org/wiki/False_positive#False_positive_...]

I don't follow. Why would we expect 5% of those 90 cases to be false positives, and what relationship does the estimate of 5% have to p-value? I don't understand how p-value could ever be used to predict the number of false positives one would expect to observe in a bundle of arbitrary tests.

A p-value cutoff of 5% says that you have a 5% probability that you're wrong in rejecting the Null Hypothesis.

So if you test 100 times, you'd expect to wrongly reject the Null Hypothesis 5 times.

> A p-value cutoff of 5% says that you have a 5% probability that you're wrong in rejecting the Null Hypothesis.

I don't think this is right. A p-value cutoff of 0.05 doesn't, by itself, indicate anything about the underlying probability of incorrectly rejecting the null hypothesis. It tells you, in a test that meets your cutoff, if the null hypothesis is true, the chance of seeing a results as strong or stronger than the results observed in the test is 5% or less. But that can't tell you the chance you're wrong in rejecting the null hypothesis.

A 1% chance of seeing results as strong as your results if the null hypothesis is true does not mean that there's a 99% chance of the null hypothesis being false.

Regardless, even putting this disagreement to one side, I still don't see how the original author's point makes sense. He or she seems to be using the cutoff as an indication of the underlying false-positive probability for prospective tests, regardless of the results of those tests meet the cutoff or not.

gabemart, you're right, a_bonobo, ronaldx, you guys are wrong. p-values are commonly misunderstood to mean that the result has "5% change of being wrong". That's not what a p-value is. Please go ahead and read the 'misunderstandings' section on p-values in wikipedia.

sigh, no. When the author says "p-value cutoff", this refers to the significance level. I interpreted this correctly.

The article is spot on. We at http://visualwebsiteoptimizer.com/ know that there are some biases (particularly related to 'Multiple comparisions' and 'Multiple seeing of data') that lead of results that seem better than they actually are. Though the current results are not wrong. They are directionally correct, and with most A/B tests even if 95% confidence is really a true confidence of 90% or less, the business will still do better implementing the variation (v/s not doing anything).

Of course, these are very important issues for A/B testing vendors like us to understand and fix, since users mostly rely on our calculations to base their decisions. You will see us working towards taking care of such issues.

I'm afraid that's not quite right. A simple python simulation will show you that a variant with -5% (ie NEGATIVE) uplift will still give a positive results around 10% of the time if you perform early stopping of the test.

No matter which method you adopt, you cannot eliminate false positives entirely. You merely decrease / control the proportion of them.

To remove all doubt, your interpretation of the statistics is incorrect. In particular this sentence is demonstrably false: "They are directionally correct, [...] the business will still do better implementing the variation (v/s not doing anything)."

> They are directionally correct, and with most A/B tests even if 95% confidence is really a true confidence of 90% or less, the business will still do better implementing the variation (v/s not doing anything).

What? That's not right at all! A confidence measure is how much you can trust that there's actually a difference. You can't say it'll improve things if your confidence is lower than your original threshold!

In addition to this, every time you change something you:

1) Might introduce bugs

2) Spend money

3) Spend time you could be spending adding a new feature or getting a new customer

> What? That's not right at all! A confidence measure is how much you can trust that there's actually a difference. You can't say it'll improve things if your confidence is lower than your original threshold!

A 95% confidence doesn't magically translate into a binary decision of winner v/s no decision. A 90% confidence means that the variation is more likely to be better than control, but of course not as likely if confidence was 95%. The p-value is an arbitrary cut-off. (A p-value of 0.945 shouldn't make you throw your results) Of course, in fields such as clinical trials, you'd want to be very sure of your results and might not want to take chances, but on the web when you're running many tests, you are usually OK with something that is probable to work better than the existing version.

Of course, if it is a high stakes A/B test on the web, you'd be as careful as a clinical trial design. We're working towards making all those techniques available within the tool itself.

Good article in general, I have a small question:

"Let’s imagine we perform 100 tests on a website and, by running each test for 2 months, we have a large enough sample to achieve 80% power. 10 out of our 100 variants will be truly effective and we expect to detect 80%, or 8, of these true effects. If we use a p-value cutoff of 5% we also expect to see 5 false positives. So, on average, we will see 8+5 = 13 winning results from 100 A/B tests."

If we expect 10 truly effective tests and 5 false positives, we'd have 15 tests that rejected the null hypothesis of h_0=h_test. Taking power into account, shouldn't we see 15*0.8, 12 winning results from the results? I.e. wouldn't one of the false positives also have not-enough-power?

Full disclosure: I work for Qubit who published this white paper.

Maybe the confusion here is in tests which have a "true" effect and an "observed" effect. If an experiment has a true effect, then you have some chance to observe it, which is the power.

But false positives have by definition already been observed as winners (that's what false positives are), so there's no need to apply the factor of 0.8 to them.

The "regression to the mean" and "novelty" effect is getting at two different things (both true, both important).

1. Underpowered tests are likely to exaggerate differences, since E(abs(truth - result)) increases as the sample size shrinks.

2. The much bigger problem I've seen a lot: when users see a new layout they aren't accustomed to they often respond better, but when they get used to it, they can begin responding worse than with the old design. Two ways to deal with this are long term testing (let people get used to it) and testing on new users. Or, embrace the novelty effect and just keep changing shit up to keep users guessing - this seems to be FB's solution.

Great read.

What bothers me about A/B tests is when people say, eg."there was a 7% improvement" without telling us the sample size, or error margin. I'd rather hear: On a sample size of 1,000 unique visits, the improvement rate was 7% +/- 4%

I really liked this; it's condescending, but in a good natured sort of way. It's as if the author was trying to explain really basic statistics to a marketer, then realized that the marketer had NO idea what he was talking about.

So you get statements like "This is a well-known phenomenon, called ‘regression to the mean’ by statisticians. Again, this is common knowledge among statisticians but does not seem to be more widely known."

I thought that was hilarious.

Martin gave this paper as a talk at our PyData London conference this weekend (thanks Martin!), videos will be linked once we have them. He shares hard-won lessons and good advice. Here's my write-up: http://ianozsvald.com/2014/02/24/pydatalondon-2014/

Would be interested to see patio11's feedback on this one.

Correct on the math, to the limit of my understanding of it and quick glance.

I am agnostic about whether most A/B testing practitioners administer their tests correctly -- of the universe of companies I've seen, far and away the most common error regarding A/B testing is "We don't A/B test.", which remains an error even after you read this article.

The novelty effect they talk about, which the article says is probably simple reversion to the mean, is -- in my opinion -- likely a true observation of the state of the world. You can watch your conversion-rate-over-time for many offers, many designs, many products, etc, and they often start out quite high and taper off, both in circumstances where there is obvious alternate causality and in circumstances where they isn't. By comparison, I have not often participated in tests where conversion rates started out abnormally low and reverted to the mean, which we'd expect exactly as often as "started out high" if that was indeed what we were seeing.

I believe so strongly in the novelty effect that I have written proposals to profitably exploit it by scalably manufacturing novelty. Sadly, none of them are public. It's on my to-do list for one of these months but a lot of things are on my to-do list for one of these months.

If you run many tests, which as time approaches infinity you darn better, your odds of seeing a false positive approach one. Contra the article, you gladly accept this as a cost of doing business, because you know to a statistical certainty that you've seen many, many more true positives.

That about sums it up. If you have any particular questions, happy to answer them. My takeaway is "Good article. Please don't use it to justify a decision to not test."

Related... someone should write a good article about estimating customer acquisition costs (CAC, or ROI if you prefer) based on conversion rates of ads.

It drives me batty when people tell me their "average" conversion rate is 1% after running a $25 ad campaign with so few clicks. It seems like too many folks are just oblivious to sample size, confidence interval, and power calculations -- something that could be solved with a quick Wikipedia search [1].

[1] https://en.wikipedia.org/wiki/Sample_size_determination

Regarding the final bullet point of doing a second validation, the sample size should be bigger right? Because of the tendency for winners to coincide with +ve random effects, you will choose a larger experiment size and expect to see a lesser result.

Visibility on this is set to "Private" is is really supposed to be linked publically on HN? I was about to Tweet a link to it and then I felt dirty, like maybe the author wanted to send the link to just a select group.

Coming from a poker background, where sample size trumps everything, I've LOL'ed at every person that has ever whipped out an A/B test on me.

This doesn't follow. What if their sample size was 100,000 conversions?

Did you even read the article? The third point is "regression to the mean."

compare and contrast this whitepaper with arguably one of the most common optimization apps out there:


In my experience it can't be overstated how important it is to wait until you have a large sample size to decide whether a variation is the winner. Nearly all of the A/B tests I run start out looking like a variation is the clear, landslide winner (sometimes showing 100%+ improvement over the original) only to eventually end up regressing toward the mean. I can't get a clear idea of the winner of a test until I've shown the variation(s) to 10s of thousands of visitors and received a few thousand conversions. I've also learned that it's important to only perform tests on new visitors when possible. That means tests need to run longer to get the appropriate sample size. If you're testing over a few hundred conversions and performing tests on new and returning visitors then you're probably getting skewed results. Again, that's just in my experience so far. YMMV. One thing to consider with a test is that the variations may be too subtle to have a significant, positive impact on conversion.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact