A much better approach is to install significant instrumentation and actually talk to users about what's wrong with your sign up form.
That, or actually build a product that users want instead of chasing after pointless metrics. I mean, really, you think changing the color of text or a call-out is going to make up for huge deficiencies in your product or make people buy it? The entire premise seems illogical and just doesn't work. The only time I've seen a/b tests truly help was when it accidentally fixed some cross browser issue or moved a button within reach of a user.
Most of the A/B website optimization industry is an elaborate scam, put off on people who don't know any better and are looking for a magic bullet.
I've personally run tests and seen tests achieve statistically significant (and very large) results that weren't merely fixing bugs or moving buttons within reach of a user. It could be things like adding a modal, changing copy (yes, literally just adding or changing words on the page), removing content, changing an image, or reordering funnel steps.
Think about that: by merely changing some sentences, your website could be 10% more successful next month. But you won't know whether it's working without a/b testing.
You don't just create tests randomly, but based off of an understanding of the product, goals, signup funnel, etc. which may involve talking to users, but ultimately verifying the change is for the better through a/b testing.
I would agree that there are a number of people running a/b tests that don't understand the math or know what they are doing, but that isn't an indictment of a/b testing.
You should read the article. It explained very well why a) statistical significance is meaningless when split testing websites (you are guaranteed to archive significance if you run the test long enough; the math behind significance testing doesn't work like assumed by Optimizely et.al.) and b) most tools on the market are inherently unable to prove effect size by the way tests are designed.
Just do this short experiment yourself: Do an a/b test with identical versions of you site. Spoiler: With 100% certainty, you'll see both significant results and a sizable effect size difference.
statistical significance is meaningless when split
testing websites (you are guaranteed to archive
significance if you run the test long enough
Do an a/b test with identical versions of you site.
Spoiler: With 100% certainty, you'll see both significant
results and a sizable effect size difference.
> Wait, what? Not if you use the right test to evaluate your change.
Sure, but how often to people actually choose the right combination of test, testing tool and estimated effect size for their significance to have actual meaning?
The problem with experiments like your typical website split test is the potentially infinite population. With a big enough N, the probability of reaching significance approaches 1.
> The whole point of significance testing is that, done properly, it tells you how likely you would be to get these results by chance if the two sides of the test were actually identical.
Yes, I know. But the magic lies in that "done properly". Have you ever approached a tool in the field that would fail my proposed test? I haven't.
With a big enough N, the probability of reaching
significance approaches 1.
On the other hand, in an A vs B experiment where B is very slightly different from A you're right that with a large enough population you become very likely to reach a p < 0.05 significance level, just with an extremely tiny effect size. The experiment has told you something about the difference between A and B, and you now have high certainty, but it's just such a small difference that you probably don't care.
Have you ever approached a tool in the field that
would fail my proposed test?
(If I didn't trust a tool to pass this test I'd just use it to get the counts for both sides of the experiment and do that stats on my own.)
A common place where people get stuck when A/B testing is in a local minimum or maximum. For example, instead of tweaking button colors someone should instead redesign the whole form. Talking to users can help with this, but users are notoriously bad at not knowing what they really want.
Anyway, on to your comment:
> A much better approach is to install significant instrumentation and actually talk to users about what's wrong with your sign up form.
I don't know what this means. Can you elaborate? What exactly is "significant instrumentation"?
> That, or actually build a product that users want instead of chasing after pointless metrics.
Pointless metrics? I'm not sure what metrics you measure but visitor engagement and visitor to trial conversion rates are the lifeblood of a software sales and marketing funnel.
When somebody in marketing considers A/B testing as a means to increasing trial and paying customer conversions, they're operating on the basis that their product already provides value. The A/B tests are a means of establishing the best method for communicating that value. Better communication = more trials. More trials = a larger pool of potential customers to convert to paying customers. It's pretty basic.
> Most of the A/B website optimization industry is an elaborate scam
What?? Had you said a number of people in marketing could be in danger of blindly relying on their tools (e.g. Optimizely) and not applying common sense or checking the maths I would have wholeheartedly agreed. In my personal experience with the various websites I market, and given the hundreds of split testing examples I've read from people I do not feel are in on the "scam", I've got to strongly disagree with this point.
There is nothing inherently wrong with A/B testing, and it's basically the only way to verify that anything you do has a positive impact. What's created a hype bubble is articles and platforms saying that minor UI changes, colour of buttons etc. make a real difference to conversion rates. Good tests based on well thought out hypotheses relevant to your actual customers, can and sometimes do help websites find measurable gains. It is not a silver bullet, and no matter how careful you are there is always statistical uncertainty, it won't always help.
In my experience this only works for obvious, glaring flaws, such as "I hate that the phone number field enforces a specific format" or "Why the hell are your password requirements so stringent?"
But a lot of UX is about what happens on the unconscious level. A user will never tell you, "Your bright red background feels too aggressive/hostile" because they simply don't recognize it at the conscious level to be able to verbalize it. The signup or checkout form might make them feel uneasy, but they won't know why.
I find that the best way to optimize websites is to do A/B testing first, and then talk to your users about the subset of the results (which might be the majority) that you find confusing or surprising. Basically, never rely on just one method, because then you will be hamstrung by the shortcomings of that method.
The overall utility of A/B tests is a complex topic and largely depends on the skill of the experimenter. Well-designed experiments are extraordinarily useful. A/B tests are one tool for product development, but they can't replace it.
Many, many people have successfully used A/B Testing. I've personally used it to great effect several times. I certainly don't make decisions purely based on the statistical results, but I find it to be an extremely useful input to the decision making process. All models are flawed; some are useful.
Even talking to users would not help much if one continues to build a faster horse.
Changing color of text is not much improvement, unless the text was unreadable or same color as background.
No amount of data driven can replace what the actual product user says.
But but but telemetry is evil!
Telemetry in my OS* that has access to everything? No.
* That I cannot turn off.
I like Windows and really like VS, I was often literally the only (non-VM) Windows user in a sea of OS X at several offices but Windows 10, the flip-flopping UI, the ads, and the telemetry pushed me to Linux and building my own desktops.
We were asked about this article before, on our community forums, and one of our statisticians, David, wrote a detailed reply to this article's concerns about one- vs two-tailed testing, which might be of interest [3rd from the top]:
Additionally, since then, as other commenters have mentioned, we've completely overhauled how we do our A/B-testing calculations, which, theoretically and empirically, now have an accurate false-positive rate even when monitored continuously. Details:
In my view, the issue is not one-tail vs two-tail tests, or sequential vs one-look tests at all. The issue is a failure to quantify uncertainty.
Optimizely (last time I looked), our old reports, and most other tools, all give you improvement as a single number. Unfortunately that's BS. It's simply a lie to say "Variation is 18% better than Control" unless you had facebook levels oftraffic. An honest statement will quantify the uncertainty: "Variation is between -4.5% and +36.4% better than Control".
When phrased this way, it's hardly surprising that deploying this variation failed to achieve an 18% lift - 18% is just one possible value in a wide range of possible values.
The big problem with this is that customers (particularly agencies who are selling A/B test results to clients) hate it. If we were VC funded, we might even have someone pushing us to tell customers the lie they want rather than the truth they need.
Note that to provide uncertainty bounds like this, one needs to use a Bayesian method (only us, AB Tasty and Qubit do this, unless I forgot about someone).
(Frequentist methods can provide confidence intervals, but these are NOT the same thing. Unfortunately p-values and confidence intervals are completely unsuitable for reporting to non-statisticians; they are completely misinterpreted by almost 100% of laypeople. http://myweb.brooklyn.liu.edu/cortiz/PDF%20Files/Misinterpre... http://www.ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf )
I got lots of pushback because it's harder to remember two numbers instead of one, and it makes reporting confusing because executives are also used to just the one. Throw in the business guys who did statistical significance by (literal quote) "experience and gut feel" instead of reliable, rigorous quantification of uncertainty, i.e. statistics, and I really couldn't do anything.
I feel like the discussion on this thread is missing the underlying "source code"
Confidence intervals are confusing and should rarely be used to communicate statistics to laypeople. (See the links in my above post.) Most of the time, you can tell people the definition of a confidence interval, and they will misinterpret this and think it's a credible interval.
This is why we went Bayesian at VWO - we felt it would be easier to change the statistics to match people's thinking than to change people's thinking about statistics.
The solution mentioned of running a two tailed test would not have solved the problem of a false result the author demonstrated through conducting an A/A test.
According to the image in the article:
The A/A test had:
A1: Population: 3920
A2: Population: 3999
2-tailed test signifiance: 99.92%
So, maybe a larger sample size would have seen a reversion to the mean, but given the size and high significance that would be unlikely (interesting exercise to try different assumptions to calculate how unlikely, with the most overly generous obviously just being the stated significance).
Yes, the test was only conducted over one day, but if it was the exact same thing being served for both, that shouldn't matter.
If there was a reversion to the mean due to an early spike, we would expect to see the % difference between the two cells narrow as the test kept running. You can see in the chart that the % difference (relative gap between the lines) stays about the same after 8pm on the 9th.
So if it's not the one-tailed test at fault, and it's not the short duration of the test at fault, what is?
I have seen in the past that setup problems are incredibly easy to make w/ a/b testing tools when implementing the tool on your site. I've seen in other tools things like automated traffic from Akamai only going to the default control, or subsets of traffic such as returning visitors excluded from some cells but not others.
Based on those results, I'd be suspicious of something in the tool setup being amiss.
It always pains me a little when people doing research describe statistical power as a type of curse. Overpowered? Should we reduce it? The risk isn't having too much power, the risk is that someone will incorrectly interpret their Null Hypothesis Significance Test (NHST). They need to shift their focus to measuring something (and quantifying the uncertainty of their measurements), rather than think of "how likely was this result given a null hypothesis", whether that hypothesis is..
something is not greater than 0 (one-tail), or
something is not 0 (two-tail).
> You’ll often see statistical power conveyed as P90 or 90%. In other words, if there’s a 90% chance A is better than B, there’s a 10% chance B is better than A and you’ll actually get worse results.
This isn't necessarily true. A could be the same as B. Also, these tests are being done from the frequentist perspective, so saying "there's X chance B is better than A" is inappropriate, unless you're talking about the conclusions of your significance test (e.g. 90% chance you correctly detect a difference between them--a difference you assume is fixed to some true underlying value). Overall, being aware that a one-tail test is taking the position that nothing can happen in the other direction is useful, but a good next step is understanding what NHST can and cannot say.
This even a frequentist vs bayesian problem, since you could create situations where a person felt a study was overpowered in either framework.
If you use one of these tools it's completely safe to run tests until the test says win.
There are still caveats to account for the real world - e.g., only stop the test after an integer number of weeks - but statistically this is a solved problem.
 The blog post under discussion was written before Optimizely's StatsEngine and VWO's SmartStats.
Standard statistical tests used in a/b testing are based on one check. If someone is checking repeatedly on a test until they get a 'significant' result, your chance of getting a getting a false positive is many X the stated significance.
Best practice - set a pre-defined end, and one or two defined early check-in points where only make an early call if result is overwhelmingly significant or if the business has fallen off a cliff.
We try to run as many experiments as our traffic can handle, but we always estimate the sample size upfront when planning the test.
>Few websites actually get enough traffic for their >audiences to even out into a nice pretty bell curve. If >you get less than a million visitors a month your >audience won’t be identically distributed and, even >then, it can be unlikely.
What is the author trying to say here? Has he thought hard about what it means for "an audience" to be identically distributed?
>Likewise, the things that matter to you on your website, >like order values, are not normally distributed
Why do they need to be?
>Statistical power is simply the likelihood that the >difference you’ve detected during your experiment >actually reflects a difference in the real world.
Simply googling the term would reveal this is incorrect.
So true and sad. In all the so called data-driven groups I have worked for, the tyranny of data makes metrics and numbers the justification for or counter to anything, however they have been put together.
> The sad truth is that most people aren’t being rigorous about their A/B testing and, in fact, one could argue that they’re not A/B testing at all, they’re just confirming their own hypotheses.
The sad truth is that most people aren’t being rigorous about anything.
There are _many_ offenders. I've yet to see a commercial tool that gets it right.
Tragically, the revamp by Optimizely neglects the straightforward Bayesian solution and uses a more fragile and complex sequential technique.
Bandit algorithms do have some important use cases (optimizing yield from a short lived advertisement, e.g. "Valentines Day Sale"), but they are not suitable for use as an A/B test replacement.
Also, I'd steer away from dynamic yield - I've found their descriptions of their statistics to take dangerous (i.e. totally wrong) shortcuts. For example, counting sessions instead of visitors as a way to avoid the delayed reaction problem and increase sample size (as well as completely breaking the IID assumption).
To be fair though, the realworld issues like nonstationarity and delayed feedback are also concerns for A/B tests (which you also bring up in your great post), and you can tweak the bayesian bandits to handle these cases decently.
How does counting sessions instead of visitors avoid delayed feedback? I read your post  but dont remember anything about that. Is it just that they say that after a session is completed (which is somewhat nebulous to measure in many cases), then you have all the data you need from the visit?
(I've also dealt with delayed reactions, but I've never published it, and probably won't publish until I launch it.)
Dynamic yield has the delayed feedback problem because users might see a variation in session 1 but convert in session 2 (days later). They "solve" this by doing session level tracking instead of visitor level tracking - the delayed feedback is now only 20 minutes (same session) instead of days.
The problem is that session A and session B are now correlated since they are the same visitor. IID is now broken.
Cf. the technical paper: http://pages.optimizely.com/rs/optimizely/images/stats_engin...
If you ever come up with a statistical method for finding the global maximum among an infinite number of unspecified alternatives, let me know.
> Most A/B testing tools recommend terminating tests as soon as they show significance, even though that significance may very well be due to short-term bias. A little green indicator will pop up, as it does in Optimizely, and the marketer will turn the test off.
People pay brisk money for this?
This seems incorrect to me. Isn't statistical power the likelihood that the null hypothesis would generate an outcome at least as extreme as what you observed?
I'm guessing the issue has a lot more to do with peeking at the outcome and not correcting for it (and similarly running many tests)
Also the next sentence is wrong:
>You’ll often see statistical power conveyed as P90 or 90%. In other words, if there’s a 90% chance A is better than B, there’s a 10% chance B is better than A and you’ll actually get worse results.
Having a test w/ 90% power means that if A is truly better than B (for one sided) or A is truly different than B (two sided), then you'll detect it 90% of the time you run the test (on independent data).
You would think, given their team of "analysts" and "statisticians", that they might have known these basic pieces of statistics.
Also, in general, the more drastic the changes are, the more of an effect you could have (up to some percentage). I.e. a small change would be changing a message or color, dont expect conversion to change by much. A large change would be changing from a flash site to an html one with a full redesign that loads twice as fast...
I also think a lot of the "change the button" tests or "increased email sign up by x times" are dubious. Unless the conversion is someone giving you money, then you still have steps to go before you see real business improvement. There are lots of ways I can minipulate my traffic to make certain funnel steps look better and more optimized, but the only thing I really care about is what's coming out of the funnel. So all those extra email signups or button pushes mean nothing if the group who perform those actions on your b version still aren't interested in an actual purchase.
My favorite example is still the quite popular Page Weight Matter posts. I wonder how close they were to abandoning a 90% reduction in size. I wonder how many improvements the world at large has thrown away due to faulty analysis.
The real problem (as you allude to in the article) is that the demand for accurate tools is not really there. Vendors don't build in accurate stats because only a tiny portion of their client base understand/demands them.
1) Run the experiment in whole business cycles (for us, 1 week = one cycle), based on a sample size you've calculated upfront (I use http://www.evanmiller.org/ab-testing/sample-size.html). Accept that some changes are just not testable in any sensible amount of time (I wonder what the effect of changing a font will have on e-commerce conversion rate).
2) Use more than one set of metrics for analysis to discover unexpected effects. We use the Optimizely results screen for general steer, but do final analysis in either Google Analytics or our own databases. Sometimes tests can positively affect the primary metric but negatively affect another.
3) Get qualitative feedback either before or during the test. We use a combination of user testing (remote or moderated) and session recording (we use Hotjar, and send tags so we can view sessions in that experiment).
As a founder, I'm constantly hearing about A/B testing and how great these tools are. I'm not enough of a statistician to know whether everything in this article is true/valid (and would welcome a rebuttal), but the part about regression to the mean really hits home. Encouraging users to cut off testing too early means that you make them feel good ("Look, we had this huge difference!"), when in reality the difference is smaller/negligible.
I'll still do some A/B testing, but given our engineering/time constraints—and my inability to accurately vet the claims/conclusions of the testing software—I won't spend too much time on this.
Learning from mistakes made by others, and avoiding them is what I would suggest as the take away.
I work for a successful ($50M+ revenue) bootstrapped startup. And one of the reasons for the success is that AB testing became part of the company's culture, as soon as there was enough data coming in for the tests to become useful.
AB testing is so important that we have built our own in-house framework that automatically gives results for our company specific KPIs.
Some quotes from the article supporting the cynical worldview:
"Most A/B testing tools recommend terminating tests as soon as they show significance, even though that significance may very well be due to short-term bias. A little green indicator will pop up, as it does in Optimizely, and the marketer will turn the test off. But most tests should run longer and in many cases it’s likely that the results would be less impressive if they did. Again, this is a great example of the default settings in these platforms being used to increase excitement and keep the users coming back for more."
This basically stops short of implying that Optimizely is doing this totally on purpose.
"In most organizations, if someone wants to make a change to the website, they’ll want data to support that change. Instead of going into their experiments being open to the unexpected, open to being wrong, open to being surprised, they’re actively rooting for one of the variations. Illusory results don’t matter as long as they have fodder for the next meeting with their boss. And since most organizations aren’t tracking the results of their winning A/B tests against the bottom line, no one notices."
In other words, everybody is bullshitting everybody, but it doesn't matter as long as everyone plays along and money keeps flowing.
"Over the years, I’ve spoken to a lot of marketers about A/B testing and conversion optimization, and, if one thing has become clear, it’s how unconcerned with statistics most marketers are. Remarkably few marketers understand statistics, sample size, or what it takes to run a valid A/B test."
"Companies that provide conversion testing know this. Many of those vendors are more than happy to provide an interface with a simple mechanic that tells the user if a test has been won or lost, and some numeric value indicating by how much. These aren’t unbiased experiments; they’re a way of providing a fast report with great looking results that are ideal for a PowerPoint presentation. Most conversion testing is a marketing toy, essentially." (emphasis mine)
Thank you for admitting it publicly.
Like whales, whose cancers grow so big that the tumors catch their own cancers and die, it seems that marketing industry, a well known paragon of honesty and teacher of truth, is actually being held down by its own utility makers applying their honourable strategies within their own industry.
I know it's not a very appropriate thing to do, but I really want to laugh out loud at this. Karma is a bitch. :).
 - http://www.nature.com/news/2007/070730/full/news070730-3.htm...
Most of the ones you hear about. You know, the ones who were SEOs or content writers or programmers before waking up one day and deciding to be marketers.
Marketing degrees have mandatory statistical courses. The good marketing programs take them very seriously. However a lot of schools focus on the communication side of marketing, which leads a very significant chunk of the marketing analyst positions to be filled by people with economics and accounting degrees.
The Internet is actually quite new and has just begun maturing. A lot of people working in digital marketing do not really understand what marketing is. When you meet somebody on the street and ask him what marketing is, he will describe advertising, or more precisely mar. communications. So when the internet became this giant medium for doing all sorts of commerce, big companies and schools couldn't fill the skill gap fast enough, so the gap was filled by self-learners coming from all sorts of backgrounds. When they wanted to build up their "marketing skills", naturally they defaulted to learning about marketing communication instead of the economics-orientated part of marketing. This is why digital marketers obsess with their site and drool over a/b tests and such. The web site is a communication medium. They've put themselves in this box equating marketing and communications.
Too bad for them, because graduates nowadays are digital native too, so they have no problem navigating the internet and learning html/css.
Unfortunately, if it did work, it would probably be through something misleading or scammy. Therefore, you need some kind of automatic legality checking... which would be hard.