
Winning A/B results were not translating into improved user acquisition - pretzel
http://blog.sumall.com/journal/optimizely-got-me-fired.html
======
pmiller2
The red flag here for me was that Optimizely encourages you to stop the test
as soon as it "reaches significance." You shouldn't do that. What you should
do is precalculate a sample size based on the statistical power you need,
which involves determining your tolerance for the probability of making an
error and on the minimum effect size you need to detect. Then, you run the
test to completion and crunch the numbers afterward. This helps prevent the
scenario where your page tests 18% better than itself by minimizing
probability that your "results" are just a consequence of a streak of positive
results in one branch of the test.

I was also disturbed that the effect size was taken into account in the sample
size selection. You need to know this before you do any type of statistical
test. Otherwise, you are likely to get "positive" results that just don't mean
anything.

OTOH, I wasn't too concerned that the test was a one-tailed test. Honestly, in
a website A/B test, all I really am concerned about is whether my new page is
better than the old page. A one-tailed test tells you that. It might be
interesting to run two-tailed tests just so you can get an idea what not to
do, but for this use I think a one-tailed test is fine. It's not like you're
testing drugs, where finding any effect, either positive or negative, can be
valuable.

I should also note that I only really know enough about statistics to not
shoot myself in the foot in a big, obvious way. You should get a real stats
person to work on this stuff if your livelihood depends on it.

~~~
dsiroker
Hi pmiller, Dan from Optimizely here. Thanks for your thoughtful response.
This is a really important issue for us, so I wanted to set the record
straight on a couple of points:

#1 - “Optimizely encourages you to stop the test as soon as it reaches
‘statistical significance.’” - This actually isn’t true. We recommend you
calculate your sample size before you start your test using a statistical
significance calculator and waiting until you reach that sample size before
stopping your test. We wrote a detailed article about how long to run a test,
here: [https://help.optimizely.com/hc/en-
us/articles/200133789-How-...](https://help.optimizely.com/hc/en-
us/articles/200133789-How-long-to-run-a-test)

We also have a sample size calculator you can use, here:
[https://www.optimizely.com/resources/sample-size-
calculator](https://www.optimizely.com/resources/sample-size-calculator)

#2 - Optimizely uses a one-tailed test, rather than a 2-tailed test. - This is
a point the article makes and it came up in our customer community a few weeks
ago. One of our statisticians wrote a detailed reply, and here’s the TL;DR:

\- Optimizely actually uses two 1-tailed tests, not one.

\- There is no mathematical difference between a 2-tailed test at 95%
confidence and two 1-tailed tests at 97.5% confidence.

\- There is a difference in the way you describe error, and we believe we
define error in a way that is most natural within the context of A/B testing.

\- You can achieve the same result as a 2-tailed test at 95% confidence in
Optimizely by requiring the Chance to Beat Baseline to exceed 97.5%.

\- We’re working on some exciting enhancements to our methodologies to make
results even easier to interpret and more meaningfully actionable for those
with no formal Statistics background. Stay tuned!

Here’s the full response if you’re interested in reading more:
[http://community.optimizely.com/t5/Strategy-Culture/Let-s-
ta...](http://community.optimizely.com/t5/Strategy-Culture/Let-s-talk-about-
Single-Tailed-vs-Double-Tailed/m-p/4278#M114)

Overall I think it’s great that we’re having this conversation in a public
forum because it draws attention to the fact that statistics matter in
interpreting test results accurately. All too often, I see people running A/B
tests without thinking about how to ensure their results are statistically
valid.

Dan

~~~
pmiller2
Thanks for replying. I agree with all the points you mention your statistician
covered, but you should make sure your users know what kind of test you're
using. The only reason I say this is because this article gives me the
impression that you were using a single one-tailed test (which, as I said in
my post, is a perfectly acceptable thing to do in the context of web site A/B
testing).

But, as far as "Optimezely encourages you to stop the test as soon as it
reaches 'statistical significance,'" I'm not saying your user documentation or
anything encourages people to stop tests early. I'm saying (and this is based
only on the article as I've never used Optimizely) that your platform is
psychologically encouraging users to stop tests early. E.g. from the article:

    
    
        Most A/B testing tools recommend terminating tests as soon as they show significance, even though that significance may very well be due to short-term bias. A little green indicator will pop up, as it does in Optimizely, and the marketer will turn the test off.
    
        <image with a green check mark saying "Variation 1 is beating Variation 2 by 18.1%">
    
        But most tests should run longer and in many cases it’s likely that the results would be less impressive if they did. Again, this is a great example of the default settings in these platforms being used to increase excitement and keep the users coming back for more.
    

I am aware of literature in experimental design that talks about criteria for
stopping an experiment before its designed conclusion. Such things are useful
in, say, medical research, where if you see a very strong positive or negative
result early on, you want to have that safety valve to either get the
drug/treatment to market more quickly or to avoid hurting people
unnecessarily.

Unless you've built that analysis into when you display your "success message"
that "Variation 1 is beating Variation 2 by 18.1%," I'd argue that you're
doing users a disservice. When I see that message, I want to celebrate,
declare victory, and stop the test; and that's not what you should encourage
people to do unless it's statistically sound to do so.

The other thing in the article that lead me to this position is that you
display "conversion rate over time" as a time series graph. Again, if I see
that and I notice one variation is outperforming the other, what I want to do
is declare victory and stop the test. That might not be
mathematically/statistically warranted.

IMO, as a provider of statistical software, I think you'd do your users a
service to not display anything about a running experiment by default until
it's either finished or you can mathematically say it's safe to stop the
trial. Some people will want their pretty graphs and such, so give them a way
to see them, but make them expend some effort to do so. Same thing with
prematurely ended experiments; don't provide any conclusions based on an
incomplete trial. Give users the ability to download the raw data from a
prematurely ended experiment, but don't make it easy or the default.

------
antr
Note on SumAll

All users who use SumAll should be wary of their service. We tried them out
and we then found out that they used our social media accounts to spam our
followers and users with their advertising. We contacted them asking for
answers and we never heard from them. Our suggestion: Avoid SumAll.

~~~
JacobSumAll
Hey Antr, Jacob from SumAll here. Sorry to hear you had a bad experience with
us. The tweets you're talking about that "spam" your accounts were most likely
the performance tweets that you are free to toggle on and off. Here's how you
can do that:
[https://support.sumall.com/customer/portal/articles/1378662-...](https://support.sumall.com/customer/portal/articles/1378662-disable-
performance-or-thank-you-tweet)

Best, Jacob

~~~
pluma
As the tweets contain both SumAll-related hash tags and Links to SumAll, this
is definitely marketing that should be opt-in, not opt-out. Unless the user of
your service is explicitly made aware of these automated tweets in clear terms
when they sign up, this is a bit shady and dishonest to say the least.

~~~
spacefight
Even if it's in the terms - do it opt-in.

------
josefresco
This article comes off as a bit boastful and somewhat of an advertisement for
the company...

"What threw a wrench into the works was that SumAll isn’t your typical
company. We’re a group of incredibly technical people, with many data analysts
and statisticians on staff. We have to be, as our company specializes in
aggregating and analyzing business data. Flashy, impressive numbers aren’t
enough to convince us that the lifts we were seeing were real unless we
examined them under the cold, hard light of our key business metrics."

I was expecting some admission of how their business is actually
different/unusual, not just "incredibly technical". Secondly, I was expecting
to hear that these "technical" people monkeyed with the A/B testing (or simply
over-thought it) which got them in to trouble .. but no, just a statement
about how "flashy" numbers don't appeal to them.

I think the article would be much better without some of that background.

~~~
falsestprophet
They are incredible as in literally not credible.

------
jere
>We decided to test two identical versions of our homepage against each
other... we saw that the new variation, which was identical to the first, saw
an 18.1% improvement. Even more troubling was that there was a “100%”
probability of this result being accurate.

Wow. Cool explanation of one-tailed, two tailed tests. Somehow I have never
run across that. Here's a link with more detail (I think it's the one intended
in the article, but a different one was used):
[http://www.ats.ucla.edu/stat/mult_pkg/faq/general/tail_tests...](http://www.ats.ucla.edu/stat/mult_pkg/faq/general/tail_tests.htm)

------
raverbashing
Oh great, another misuse of A/B testing

Here's the thing, stop A/Bing every little thing (and/or "just because") and
you'll get more significant results.

Do you think the true success of something is due to A/B testing? A/B testing
is optimizing, not archtecting.

~~~
seanflyon
Indeed. A/B testing will get you stuck on local optimums.

------
ssharp
It seems like I see these articles pop up on a regular basis over at Inbound
or GrowthHackers.

I think the problem is two-sided: one on the part of the tester and one on the
part of the tools. The tools "statistically significant" winners MUST be taken
with a grain of salt.

On the user side, you simply cannot trust the tools. To avoid these pitfalls,
I'd recommend a few key things. One, know your conversion rates. If you're new
to a site and don't know patterns, run A/A tests, run small A/B tests, dig
into your analytics. Before you run a serious A/B test, you'd better know
historical conversion rates and recent conversion rates. If you know your
variances, it's even better, but you could probably heuristically understand
your rate fluctuations just by looking at analytics and doing A/A test. Two,
run your tests for long after you get a "winning" result. Three, have the
traffic. If you don't have enough traffic, your ability to run A/B tests is
greatly reduced and you become more prone to making mistakes because you're
probably an ambitious person and want to keep making improvements! The nice
thing here is that if you don't have enough traffic to run tests, you're
probably better off doing other stuff anyway.

On the tools side (and I speak from using VWO, not Optimizely, so things could
be different), but VWO tags are on all my pages. VWO knows what my goals are.
Even if I'm not running active tests on pages, why can't they collect data
anyway and get a better idea of what my typical conversion rates are? That
way, that data can be included and considered before they tell me I have a
"winner". Maybe this is nitpicky, but I keep seeing people who are actively
involved in A/B testing write articles like this, and I have to think the
tools could do a better job in not steering intermediate-level users down the
wrong path, let alone novice users.

------
pocp2
What he did in that article is more commonly known as an "A/A test"

Optimizely actually has a decent article on it:
[https://help.optimizely.com/hc/en-
us/articles/200040355-Run-...](https://help.optimizely.com/hc/en-
us/articles/200040355-Run-and-interpret-an-A-A-test)

------
jmount
I just checked in one possible R calculation of two-sided significance under a
binomial model under the simple null hypothesis A and B have the same common
rate (and that that rate is exactly what was observed, a simplifying
assumption) here
[http://winvector.github.io/rateTest/rateTestExample.html](http://winvector.github.io/rateTest/rateTestExample.html)
. The long and short is you get slightly different significances under what
model you assume, but in all cases you should consider it easy to calculate an
exact significance subject to your assumptions. In this case it says
differences this large would only be seen in about 1.8% to 2% of the time (a
two-sided test). So the result isn't that likely under the null-hypothesis
(and then you make a leap of faith that maybe the rates are different). I've
written a lot of these topics at the Win-Vector blog [http://www.win-
vector.com/blog/2014/05/a-clear-picture-of-po...](http://www.win-
vector.com/blog/2014/05/a-clear-picture-of-power-and-significance-in-ab-
tests/) .

They said they ran an A/A test (a very good idea), but the numbers seem
slightly implausible under the two tests are identical assumption (which
again, doesn't immediately imply the two tests are in fact different).

The important thing to remember is your exact significances/probabilities are
a function of the unknown true rates, your data, and your modeling
assumptions. The usual advice is to control the undesirable dependence on
modeling assumptions by using only "brand name tests." I actually prefer using
ad-hoc tests, but discussion what is assumed in them (one-sided/two-sided,
pooled data for null, and so on). You definitely can't assume away a thumb on
the scale.

Also this calculation is not compensating for any multiple trial or early
stopping effect. It (rightly or wrongly) assumes this is the only experiment
run and it was stopped without looking at the rates.

This may look like a lot of code, but the code doesn't change over different
data.

~~~
davnola
What do you mean by "brand name tests"?

------
thoughtpalette
I was looking for a much more personal article from the headline.

------
hvass
I would be curious to know what percentage of teams with statisticians / data
people actually use tools like Optimizely? A lot of people seem to be building
their own frameworks that use a lot of different algorithms (two-armed
bandits, etc.). From my understanding, Optimizely is really aimed at marketers
without much statistical knowledge.

Of course, if you're a startup, building an A/B testing tool is your last
priority, so you would use an existing solution.

Are there much more advanced 'out-of-the-box' tools for testing out there
besides the usual suspects, i.e. Optimizely, Monetate, VWO, etc.?

------
kareemm
This title used to read "How Optimizely (Almost) Got Me Fired", which is the
actual title of the article.

It seems a mod (?) changed it to "Winning A/B results were not translating
into improved user acquisition".

I've seen a descriptive title left by the submitter change back to the less
descriptive original by a mod. But I'm curious why a mod would editorialize
certain titles and change them away from their original, but undo the
editorializing of others and change them to the less descriptive originals.

~~~
dshacker
I feel that the second title is better, as it talks about the kind of testing
they are using, instead of being a click bait of "HOW DID IT GET YOU FIRED?".

~~~
kareemm
My question is why mods change some headlines away from the originals to be
more descriptive (good) and why they change back to the originals even though
they are less descriptive (bad).

FWIW the change to this headline seems like the right decision to me.

~~~
dang
The guideline is to use the original title _unless it is misleading or
linkbait_ [1]. It's astonishing how often that qualifier gets dropped from
these discussions. It's pretty critical, and makes the reason for most title
changes pretty obvious.

1\.
[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

~~~
kareemm
Thanks for the response. I'd humbly submit that there are occasions where the
guidelines should be ignored in service of a more descriptive (non-linkbaity)
title.

I can't find the submission but one recent example that comes to mind is a
presentation on radar detectors that was fascinating. I clicked because the
submitter described the article; the original title was (IIRC) the model
number of the radar gun.

Later a mod changed the HN post back to the model number, which had zero
relevance to anybody not in the radar gun industry.

------
tieTYT
> The kicker with one-tailed tests is that they only measure ­– to continue
> with the example above – whether the new drug is better than the old one.
> They don’t measure whether the new drug is the same as the old drug, or if
> the old drug is actually better than the new one. _They only look for
> indications that the new drug is better..._

I don't understand this paragraph. They only look for indications that the
drug is better... than what?

------
dk8996
Do any of these tools show you a distribution of variable your trying to
optimize? I am just thinking that some product features might be polarizing
but if you measure, the mean it might give you different results than
expected. I am thinking that's where the two-tailed comes in.

------
hawkice
Perhaps the most troubling element is that optimizely seems comfortable
claiming 100% certainty in anything. That requires (in Bayesian terminology)
infinite evidence, or equivalently (in frequentist terminology) if they have
finite data, an infinite gap between mean performances.

------
dmourati
Peculiar use of the word bug in this context:

"They make it easy to catch the A/B testing bug..."

~~~
rrrx3
meaning "fever" \- generally cured by more cowbell, but in this case only
"curable" by more A/B testing

------
dsugarman
this is all fine and good, but if you're goal is to see what works best
between X new versions of a page and you are rigorous in creating variants,
Optimizely is a great tool for figuring out the best converting variant.

~~~
pdpi
Except, apparently, they aren't actually that good at _that_. If an A/A test
to not yield 100% chance of 18% uplift, what gives you any degree of certainty
that other tests won't have equally skewed results?

~~~
vitamen
Run an A/A/B (or A/A/B/B) test, decide on traffic levels before you start the
test, and let it run until you reach those levels before you peek.

------
fvdessen
In my experience Optimizely does everything they can to mislead their users
into overestimating their gains.

Optimizely is best suited at creating exciting graphs and numbers that will
impress the management, which I guess is a more lucrative business than
providing real insight.

------
claar
The headline isn't really what this article is about, particularly the
disparaging of Optimizely. Might I suggest "The dangers of naive A/B testing"
or "Buyer beware -- A/B methodologies dissected" or "Don't Blindly Trust A/B
Test Results".

------
michaelhoffman
Where's the part where he "(almost)" got fired?

~~~
markolschesky
Maybe that's the headline that did best in an A/B test.

