
Optimizely Statistics Engine - kylerush
https://www.optimizely.com/statistics
======
darkxanthos
Every time they say "classic statistics" just insert "what we did before now"
and see how frustrated you get with this announcement. The whole point of
using them is that people don't need a statistician because the tool should
make it easy to run solid tests. That of course hasn't been the case and
they're finally admitting it.

~~~
dargani123
Thanks for your comment. This is Darwish, the Product Manager working on Stats
Engine. You are correct "classic statistics" is the method we used in the
past. It also what is most commonly used in industry (the main reason we
started with this method). This was not an easy project for us to take on, but
after talking to customers and looking at our historical experiment data, it
was clear how important this problem was to solve, and thats why we spent a
lot of resources on fixing this. Just for those following along on this
comment, its not that "classic statistics" on their own that are incorrect,
but rather the misuse of these statistics that can be costly. When used
"incorrectly" (not using a sample size calculator, running many goals and
variations at a time etc..), you can meaningfully increase your chance of
making a bad business decision or commit yourself to unnecessarily long sample
sizes. Using statistics correctly is an industry-wide problem that many have
tried to solve with education (i.e. give statistics crash courses). We hope
that our solution shows how important we think it is that statistics drive
day-to-day decisions in organizations and that there are different ways
(change the math, not the customer) to get customers to this point. Many
companies have data science teams and in-house statisticians that are very
aware of these problems, but many don't and thats really where we wanted to
help out. You can read more about why we thought this was a serious problem
here: [http://blog.optimizely.com/2015/01/20/statistics-for-the-
int...](http://blog.optimizely.com/2015/01/20/statistics-for-the-internet-age-
the-story-behind-optimizelys-new-stats-engine/)

~~~
thinkmoore
What's so particularly embarrassing is that you clearly did not have any
competent statisticians on board until now. This was not some big surprise
that needed "a lot of resources to fix." This is something that should be
obvious to anyone who understands hypothesis testing, and is something that
statisticians have been describing how to do correctly for over 50 years:
[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1551774/](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1551774/)

~~~
joecasson
By that logic, no one else in the industry (Adobe, Mixpanel, VWO) has
competent statisticians. That's silly.

~~~
thinkmoore
Apparently that might be the case. I mean the problem with multiple hypothesis
testing is not exactly rocket science...

------
aaronjg
I wrote about the problem with sequential testing in online experiment three
years ago on the Custora blog [1]. And Evan Miller wrote about it two years
before me on his blog [2]. I'm glad to see Optimizely finally getting on
board. Communicating statistical significance to marketers is always
challenging, and I'm sure this will lead to better decisions being made.

[1] [http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-
te...](http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-testing/)

[2] [http://www.evanmiller.org/how-not-to-run-an-ab-
test.html](http://www.evanmiller.org/how-not-to-run-an-ab-test.html)

~~~
dargani123
Hey guys,

Answering addressing a few comments right here. I think the industry deserves
a lot of credit in its efforts to help those wanting to run A/B tests. Many
people were aware these were issues and many actually tried to fix it (us
included). There are many blog posts in the community about why continuous
monitoring is dangerous, why you should use a sample size calculator, how to
properly set a Minimum Detectable Effect etc... We were part (and definitely
not the first) of this group as we published a sample size calculator and
spent a lot of time working with our clients on running tests with a safe
testing procedure.

However, after doing this and looking more closely attempting to quantify the
effect of these efforts we saw an opportunity for a simpler solution that
could help even more people. Sequential Testing was this solution, and it's
had success in other applications. We wanted to bring sequential testing to
A/B testing and take the hard work out of doing it correctly. Specifically, we
have built on that groundwork laid in 50's and 60's by providing an always
valid notion of p-value that customers are looking for.

While traditional sequential testing combats the continuous monitoring problem
well, they require you to have an intimate understanding of the solution that
can pose cognitive hurdles for those not well-versed in statistics. You have
to either know your target effect size, or have in mind a maximum allowable
number of visitors and understand how changes in these will affect the run
time of your test. What’s more, it is not straightforward to translate results
to standard measures of significance such as p-values. This is actually where
the biggest research contribution of Stats Engine comes in. We allow you to
run a test, detect a range of effect sizes and provide an always valid FDR-
adjusted p-value as opposed to a set of stopping rules that bounds Type 1
error at say 5%. The error rates are valid no matter how the user chooses to
interact with the A/B test. Also, FDR control itself has only been around over
the last 20-25 years.

Our biggest industry contribution is probably much simpler in us moving a lot
of the market to sequential testing more generally. We are happy to be in the
position to help build on research and bring this to practical applications.

------
nsaje
What a misleading comparison:
[http://i.imgur.com/aWqxk2U.png](http://i.imgur.com/aWqxk2U.png)

"95% chance you'll make the _right_ decision" vs "30% chance you'll make the
_wrong_ decision", emphasis mine.

------
debacle
This graph is a joke: [https://d1qmdf3vop2l07.cloudfront.net/optimizely-
marketer-as...](https://d1qmdf3vop2l07.cloudfront.net/optimizely-marketer-
assets.cloudvent.net/raw/statistics/sequential-testing-graphic-us.png)

And any company trying to sell a statistical tool/package that would actually
create a graph like that is selling snake oil. Your model only gets better,
digitally, and never sees a regression? And you're using this for _web
analytics?_

~~~
leo_pekelis
Hello, Leo, Optimizely's in-house statistician here. The graph you reference
is a schematic to show the differences between Optimizely’s previous
statistical platform and Stats Engine. It shows a monotone non-decreasing
significance because under our sequential testing framework, the significance
value represents the total amount of accumulated evidence against the null
hypothesis of no difference between a variation and baseline. This wealth of
evidence cannot decrease because you can only get more information about your
test as you get more visitors. Of course, it is very possible, as the graph
shows, that you will not acquire enough contradictory evidence to reach
significance in a reasonable number of visitors. What would have happened if
we instead looked continuously at a classical t-test, is the significance
would oscillate near the significance threshold. Spurious deviations would
cause multiple, contradictory declarations that the test is significant and
then not. A savvy A/B tester might wait until the oscillations die down.
Sequential testing is a principled, mathematical way to differentiate evidence
against the null hypothesis from random oscillations in real time. It should
be noted that the chance of a type I error is still controlled at 5%.

You do make a good point that sometimes an A/B test will see regression over
time. We have explicitly separated this out because we feel detecting a change
in the underlying effect size is different from testing whether the effect is
non-zero, and different statistical methods are better suited to one over the
other. We have built a policy into our framework that monitors for such
temporal effects and signals an A/B test is in a ‘reset’ when we discover
them. In our historical database, this happened on about 4% of tests.

I concede all this is a lot to get across in one graph, but we do feel that it
is a good representation of how significance behaves under Stats Engine. If
you would like to read more about the math behind stats engine, here is a link
to a full technical article:
[http://pages.optimizely.com/rs/optimizely/images/stats_engin...](http://pages.optimizely.com/rs/optimizely/images/stats_engine_technical_paper.pdf)

~~~
dlss
> It shows a monotone non-decreasing significance because [the value]
> represents the total amount of accumulated evidence against the null
> hypothesis.

> if we instead looked continuously at a classical t-test, is the significance
> would oscillate near the significance threshold

So there's your answer: the y-axis on the chart has an unlabeled different
meaning for the blue line.

While I have you here Leo, can you explain why you would want to chart only
the accumulated evidence for X? It's meaningless without knowing how much
evidence has been accumulated for not X.

~~~
leo_pekelis
One point of clarification, the y-axis on the chart does have the same meaning
for both lines. It is 1 minus the chance of committing a type I error. I think
you do point out an important nuance that under sequential testing a type I
error changes to “ever detecting a significant result on an insignificant
test” instead of just at one, predetermined visitor count.

The amount of accumulated evidence for X is exactly a p-value, or a
measurement which can tell you if there is enough evidence in the experiment
to contradict a hypothesis of “no difference between a baseline and
variation.” A high p-value, or low significance tells you there is a lack of
evidence to make this claim.

You bring up a very interesting point which is that with sequential testing it
is actually possible to also look for evidence of ‘not X’ or that there really
is no detectable difference. This works by ‘flipping the hypothesis test on
it’s head’ and allows for a mathematical formulation of stopping early for
futility. We do not currently offer this in Stats Engine because we believe
it’s the less important quantity of the two, but it may be the focus of future
research.

------
bartkappenburg
The claimed math behind the 'new engine':
[http://pages.optimizely.com/rs/optimizely/images/stats_engin...](http://pages.optimizely.com/rs/optimizely/images/stats_engine_technical_paper.pdf)

------
calpaterson
I'm someone that would consider using Optimizely: no formal stats background
but understand high school stats, work on web apps professionally and
interested in analytics and testing. I've watched the video, read everything
on the page and I still don't understand what they're trying to tell me here.

Based on my admittedly limited understanding of stats, unless you set the
sample size and decide what significance is in advance your test will probably
misinform you. Nothing on this landing page explains to me how this new thing
might mean otherwise and it really doesn't help that the page is otherwise
full of hubris, eg:"goodbye traditional statistics". Somehow it seems unlikely
that a web startup just invalidated all of statistics

~~~
leo_pekelis
Hey,

If you are interested in learning more about the problems we solved, we lay
them out in much more detail in our blog post. It’s a meaty topic and so the
post is not short. [http://blog.optimizely.com/2015/01/20/statistics-for-the-
int...](http://blog.optimizely.com/2015/01/20/statistics-for-the-internet-age-
the-story-behind-optimizelys-new-stats-engine/)

In a few sentences: in the past if you didn’t use a sample size calculator
properly (set a sample size up front and only evaluate test results at that
time) and had tests with a lot of goals and variations, you could increase
that chance of making an incorrect declaration. With Stats Engine, we allow
you to monitor your results real-time and test as many hypothesis as you would
like and give an accurate representation of the likelihood your test is
actually a winning or losing test. We’re definitely not trying to claim to
have invalidated traditional statistics. If you used the proper testing
procedure in the past, then Stats Engine will simply give you an easier
workflow than before (no need to pick an appropriate sample size, minimum
detectable effect, limit the number of hypotheses being tests). Many companies
have data-scientists, statisticians, or are otherwise well-informed on the
topic, however many are not. Stats Engine allows you to have an accurate
Statistical Significance measure without requiring you to set a sample size
because it accounts for the errors that are introduced by looking at your
results as experiment data comes in.

------
B-Scan
For those who don't want to enter personal details, PDF is on this URL:
[http://pages.optimizely.com/rs/optimizely/images/practical_g...](http://pages.optimizely.com/rs/optimizely/images/practical_guide_to_stats.pdf)

------
austincav
I am surprised by all the negative commentary here. On the whole, companies
like Optimizely, RJMetrics, Custora, and others are doing more to push
statistical analysis to the mass market than anyone else. These tools are not
designed for statisticians or ML practitioners so it makes sense they do not
put language like Bayesian, etc. front and center. IMO, the more people using
data to make decisions, the better.

~~~
ayy88
It's not that they don't put in 'language like Bayesian', it's a different
method. Yes, it is an improvement on the t-test straw-man they mention, but
it's less flexible and powerful than Bayesian methods. Once you have a
posterior, you can ask different questions that their p-values/confidence
intervals don't address. For example, probability of an x% increase in
conversion rate, or the risk associated with choosing an alternative. Not too
mention multi-armed bandits, which not only are expected to arrive at an
answer faster, but also maximize conversions along the way.

~~~
leo_pekelis
While I do agree that a sequential hypothesis test like the one we implemented
in Stats Engine is different than a completely Bayesian method, I wouldn’t
necessarily call it less powerful. In fact, numerous optimality properties
exist showing that a properly implemented sequential test minimizes the
expected number of visitors needed to correctly reject a null hypothesis of
zero difference. I should note that our particular implementation does use
some Bayesian ideas as well.

I agree that a benefit of Bayesian analysis is flexibility. Different
posterior results are possible with different priors. But in practice this can
be a hindrance as well as a benefit. When answer depend on a choice of prior,
misusing, or misunderstanding the prior can lead to incorrect conclusions.

There is also a very attractive feature of Frequentist guarantees specifically
for A/B testing. They make statements on the long-run average lift, which is a
quantity that many businesses care about: what will my average lift be if I
implement a variation after my A/B test?

That said, we have, and continue to look at Bayesian methods because we don’t
feel that we have to be in either a Frequentist or Bayesian framework, but
rather use the tools that are best suited to answer the sorts of statistical
questions our customers encounter.

Finally, there have been some very interesting results lately on the
connections between sequential testing and bandits! (for example, see here:
[http://auduno.github.io/SeGLiR/documentation/reference.html](http://auduno.github.io/SeGLiR/documentation/reference.html)
)

~~~
yummyfajitas
_They make statements on the long-run average lift, which is a quantity that
many businesses care about: what will my average lift be if I implement a
variation after my A /B test?_

Could you state clearly what this guarantee is? Unless I'm making a stupid
mistake, such guarantees are impossible even in principle with frequentist
statistics.

------
will_lam
Interesting that Optimizely is positioning themselves as the over arching
discipline as "Statistics reinvented for the internet age". My guess is to
parry against the onslaught of A/B testing and optimization platforms for web
and mobile from all directions. Of course with their stable of PHD
statisticians and data scientists, Optimizely is the answer.

~~~
mziel
"Statistics reinvented for the internet age" looks like a cheap knock-off of
the long known bayesian statistics.

~~~
darkxanthos
It's actually sequential testing which is not necessarily Bayesian. Also, it's
not a cheap knock off. Just another way of getting to an answer.

~~~
mziel
Bayesian updating of the posterior? Or if you prefer frequentist algorithms
for online learning of classifiers?

Not trying to pick a fight, just as a statistitian/ML developer I've seen the
same things be reinvented and renamed so many times.

~~~
jmalicki
No, if you read their technical paper, it's frequentist sequential testing
with false discovery rate control, which is a fairly recent development (I
mean, 25 years old is pretty new in statistics).

[http://pages.optimizely.com/rs/optimizely/images/stats_engin...](http://pages.optimizely.com/rs/optimizely/images/stats_engine_technical_paper.pdf)

~~~
dlss
I think all OP is trying to point out is that it either agrees with bayesian
methods or _it 's wrong_... so at best it's not materially new, and at worst
it's using questionable assumptions.

~~~
GFK_of_xmaspast
"it either agrees with bayesian methods or it's wrong"

This kind of faith-based statistics is pure ideology.

------
brownesauce
Fantastic to see Optimizely changing their stats model. The more education
that is done on web experimentation the better as there certainly still is a
lot of snake oil being sold out there!

Their chosen technique is one way of solving the problem of communicating
statistics to non-technical audiences however the interpretation of the
results may suffer here. I can imagine that this technique will lead to
overestimations of the effect size in situations where the threshold is
reached early in an experiment as it will reward extreme values observed when
the experiment is under powered.

~~~
leo_pekelis2
Glad you like that we’re changing things up!

You do bring up a good point. Even though a sequential test is able to be
called much earlier than a fixed horizon test (note this only happens when the
effect size is large enough to still guarantee Type I error control), it does
not change the fact that estimates of the effect size are more variable when
there are fewer visitors. The way we are addressing this is to make confidence
intervals more prevalent in our platform. The width of confidence intervals
represents our uncertainty in the magnitude of the true effect size with the
information currently available. They correctly get more narrow as the
experiment goes on as there is increased information from more visitors.

------
Throwaway1224
a lot of people are denigrating the site based on its content, but i'll, fully
expecting to be down-voted, go out on a limb and say what I think we are all
thinking: "i just don't like that guy's sweater".

------
triz
This is a huge step in the right direction.

------
mkirlin
Subject-verb agreement for the internet age will have to wait for a while.

~~~
madcaptenor
Perhaps the word "statistics" is singular, like "mathematics".

~~~
mgkimsal
It's something I've heard both ways. Whether it's correct in both cases ...
I'm not sure. "I end up feeling like a statistic" is a singular form, for
sure, but I often hear/read "statistics" referenced singularly as well.

~~~
Bahamut
Statistics refers to the subject (usually), while a statistic has a specific
meaning within statistics.

------
normloman
Offtopic: Who else is tired of companies putting LY at the end of their name?

------
goodmachine
'are here'

~~~
michaelmior
I believe they're referring to the field of statistics as a unit instead of a
collection of multiple statistics.

~~~
goodmachine
Yes, but it reads badly.

~~~
mattwad
"Statistics class for the internet age is here".. points out that people don't
have time to waste to learn "classic statistics" :)

