
Why Wikipedia's A/B testing is wrong - synference
http://synference.blogspot.com/2014/01/why-wikipedias-highly-successful-ab.html
======
EvanMiller
I won't comment on the virtues of bandit algorithms, but the author grossly
mischaracterizes the requirements of a multivariate A/B test. Just do a
regular 50/50 split and run a regression on the outcome using the observable
characteristics as explanatory variables. Then take that model and feed the
observables back in to get an individualized prediction as to whether A or B
will perform better. You can bin continuous variables or you can including
higher-order terms to achieve a polynomial fit. The data requirements really
aren't that large.

I believe the author's confusion stems from attempting to model every
interaction term, e.g. if you think there will be some extra effect to being
Male _and_ using Windows _and_ being Swedish that is more than the "Male
effect" and "Windows effect" and the "Sweden effect" considered individually.

I wrote a little article on implementing a simple multivariate A/B test from
scratch:

[http://www.evanmiller.org/linear-regression-for-fun-and-
prof...](http://www.evanmiller.org/linear-regression-for-fun-and-profit.html)

All you really need is a function that inverts a matrix and you're good to go.
The cool part is that you can actually throw away most of the data that comes
in from the firehose, just keep updating the Hessian until you're ready to
invert it and bam, you have a model.

~~~
feral
(Synference cofounder here)

That's a good blog post.

There are many ways to handle multi-variate data, with different tradeoffs.
The linear model you describe in your post is one of the classic machine
learning models often used. In the post, we address the shortcoming of such
models in the subsection " _Trading off exploration vs. Exploitation._ "

As we point out, such a model ignores the trade-off between exploration and
exploitation. That means that you have to gather all your data by randomly
serving variants, before you can figure out the right variant to serve to each
user.

That's very expensive in terms of serving-the-wrong-variant, if what you are
trying to predict is conversion or revenue. By contrast, our method will
quickly realise if a particular variant has a very high expected value for the
current user (with the current user's particular combination of features), and
so serve that user that variant - even if we aren't confident about other
users.

So, like any bandit algorithm, we trade off exploration and exploitation; but
further, we trade this off not just across the population as a whole, but on a
personalised basis. That's a very important functionality that the approach
you describe doesn't have.

Why is this an issue? Because if you have to gather data by randomly sampling,
what if a certain sort of user doesn't appear very often? (e.g. lets say for a
particular site, that's a 65 year old female from Canada). Do you randomly
sample until you see a user like this? If so, you are randomly sampling for _a
long time_. Or do you stop your exploration before you see rare users? This
totally neglects your rarer users. What if rare users have unusual
preferences, or are particularly valuable? (E.g. maybe your rare users are
your biggest spenders - which happens.)

And, if you have high dimensional data, all your users are effectively rare
(because you rarely see users with that _exact_ combination of features).

Furthermore, the 'gather data randomly' approach is also impractical if user
behaviour is dynamic, as it often is in real-world settings. (i.e. the
preference of your user base as a whole changes - maybe they've all recently
read an article, so a different marketing copy now resonates with them)

Finally, sometimes those interactions you mention, between different features,
are really important, and shouldn't be neglected.

I could go on - there's a lot of subtle issues that the approach we are using
solves that simple multivariate testing doesn't handle, basically.

What we're doing is trying to provide an API that wraps all this knowledge up
into a service that's easy for people to use.

------
adwf
"Highly successful" and "All wrong"? There's an oxymoron if ever I've heard
one.

~~~
feral
(Synference cofounder here)

It has successfully increased Wikipedia's donation revenue; however, some
fundamental assumptions baked into the AB-testing approach are all wrong; this
leaves a lot of money on the table.

Instead of trying to find a single best version for everyone, people should
realise that different segments of their users are going to have different
preferences. Machine learning can find these in an automated way, and there's
a lot of value to be gained - but people need to see past the standard 'one
size fits all' AB-testing approach to take advantage of it.

~~~
ilyanep
There's a huge difference between "all wrong" and "has some fundamental
assumptions that could lead to non-optimal results if you wanted to push the
results even further". The former implies, upon initial reading of the
headline, that whatever Wikipedia is doing doesn't even give you a better
solution.

A better headline might have been "Hidden assumptions in Wikipedia's A/B
testing and how it could be improved". Also, why are we picking on Wikipedia
here? Don't a lot of companies claim to do A/B testing and isn't this a
problem inherent in all simple A/B testing?

~~~
feral
AB-testing is much better than doing nothing.

But you have to be careful with any powerful tool, in case its success blinds
you to its weaknesses. "When all you have is a hammer, everything starts to
look like a nail". We think that's happening with AB-testing at the moment.

We think Wikipedia is awesome, and would love them to get more donations by
using a more sophisticated approach.

Yes, most companies doing AB-testing, if they have the ability to personalise
their user experience (i.e. they aren't trying to quickly find the single best
UI) could benefit.

However, Wikipedia is a really good example to start with - the phrase
'Wikipedia needs those nickels' is a great example - resonates well with US
donators, will probably work in Canada, but what about the UK? Australia?

It's obvious once its pointed out - but wouldn't it be better if the system
automatically realises this? And considers all the combinations? That's our
point.

------
noelwelsh
I'm a big proponent of bandit algorithms (I'm a founder of Myna
[http://mynaweb.com/](http://mynaweb.com/), something of a competitor to
Synference). I think there is a bit of nuance here that is missing in the
discussion.

Should Wikipedia use a contextual bandit system? Assuming the cost is
equivalent, I say yes, for the following reasons:

\- they have a shedload of traffic, and then some; and

\- they have a limited time in which the campaign is running.

Let's address these in turn.

The regret bounds on contextual bandits are O(d sqrt(T)), where d is the
dimensionality of the user profile you're using and T is time (see e.g.
[http://arxiv.org/abs/1209.3352](http://arxiv.org/abs/1209.3352)). In
contrast, the standard stochastic bandit has regret bounds of O(log(T)), which
is much tighter. The summary is you need a lot more data to explore the bigger
range of possibilities you've created. So I'm not sure contextual bandits are
viable for smaller sites, but Wikipedia definitely has the traffic to make
them work.

If your test is running for a limited time you can just delete the code when
you're done. This has several advantages. People running A/B tests typically
want to answer the question "what is best?" (To crush your enemies, obviously,
but you can't usually say that) under the assumption that you'll then display
the best variant forever. Contextual bandit algorithms don't answer this
question; instead they minimise regret (equivalently, making you the most
money) which makes more sense in the limited time setting. In fact there is no
"best" with a contextual bandit -- the best variant depends on the user
profile. This means you need to run the algorithm forever: good if you're
collecting monthly subscriptions based on algorithm use, but bad if you're
trying to manage a site with an exploding number of variants under test that
you can never delete.

Ob plug: If you've found this post interesting you should consider signing up
for the bandit course I'm running: [http://noelwelsh.com/data/2013/08/16/free-
bandit-course-draf...](http://noelwelsh.com/data/2013/08/16/free-bandit-
course-draft-outline/) It will cover contextual bandits, and a whole heap of
other interesting stuff.

~~~
feral
>The regret bounds on contextual bandits are O(d sqrt(T)), where d is the
dimensionality of the user profile you're using and T is time (see e.g.
[http://arxiv.org/abs/1209.3352](http://arxiv.org/abs/1209.3352)). In
contrast, the standard stochastic bandit has regret bounds of O(log(T)), which
is much tighter. The summary is you need a lot more data to explore the bigger
range of possibilities you've created. So I'm not sure contextual bandits are
viable for smaller sites, but Wikipedia definitely has the traffic to make
them work.

Its definitely viable for smaller sites. Your argument there is about the
bounds in a particular theoretical situation ("when the contexts are provided
by an adaptive adversary.")

In a real world situation, your contexts are provided by the normal behaviour
of your users, not an adversary. And, rather than the patterns being maximally
hard to find, there'll instead be a lot of low-hanging-fruit - in terms of
very obvious combinations of features that define different segments of users.

What's important is not the bound of the contextual bandit in a theoretical
situation, but rather what performance it gives in 90% of business situations.
In many situations, doing some personalisation to automatically find the most
obvious user groups is a huge win.

And later, as the amount of data increases, the granularity of segmentation
will effectively increase.

I agree with your general comments on bandits, and love what Myna is doing.

------
wmt
Implementing more complex machine learning methods isn't free, and neither is
bying one, so the system should bring in dramatically more money just to cover
its costs.

The company whose ad this post was is not showing much data about the
difference between the solution it's selling and A/B testing. I wonder if
there isn't any or if it is not supporting it's price?

~~~
feral
I'm a cofounder of the early 2-person startup whose blog post this is.

Yes, it's currently difficult and expensive to bring complex ML to bear on
this problem; we are trying to make that easy and cheap. The reason people
adopted AB-testing was that it has a positive impact on the metrics they care
about; we think the workflow can be easier and more valuable than AB testing.
If its easy to use, the extra value should cover the cost.

We're currently focused on trying to communicate that this approach can create
value - this blog post, written for a technical audience, is to give some
intuition around how the approach works - does that make sense?

We genuinely believe in making it easy to use ML optimisation in this context
- that's why we're working on this.

------
thatthatis
It's a minor point, and perhaps only a quibble, but the title is wrong.

Wikipedua's AB testing is correct in as far as the problem and approach have
been defined. This article suggests an improvement.

Wrong is when they're trying to test for something but due to some error they
aren't actually testing for what they think they are.

Something can be accurate/correct but also improvable.

~~~
davidgerard
That's because this post exists as attempted marketing for Synference.

~~~
feral
Yes, its marketing, at least in the broad sense of the word - we think people
should use contextual bandits more, and we want people to know that.

That's why we're working on this problem.

As I said elsewhere, we think AB-testing is a big improvement over nothing,
but, like any powerful tool, it can blind people to its downsides.

Our blog title is designed to point out, succinctly, that we think the vanilla
AB-testing framework is wrong for Wikipedia, as an easy-to-understand example.
We believe that.

I notice that you volunteer for Wikipedia; I want to be clear, we're not
trying to pick on Wikipedia here; I personally love Wikipedia. Its just a good
well-known example of a successful AB-testing campaign, that we think could be
way improved.

------
thrush
It's awesome to see a Bandit-based algorithm being used. I think this is
really similar to what is called Adaptive Sampling (also based on the Bandit
problem), and it's been used in multiple contexts. My favorite has been in
clinical trials where adaptive sampling was able to save lives.

Read more about the research done in Clinical Trials, and one of my favorite
Professors here:
[http://web.eecs.umich.edu/~qstout/AdaptiveDesign.html](http://web.eecs.umich.edu/~qstout/AdaptiveDesign.html)

------
CGamesPlay
I found it was useful to brush up on how the multi-armed bandit problem works
in order to understand this article. [http://en.wikipedia.org/wiki/Multi-
armed_bandit](http://en.wikipedia.org/wiki/Multi-armed_bandit)

------
zackbloom
This is a little silly. Of course every form would convert better if you
perfectly customized it to the current user, no one really disagrees with
that. But it doesn't mean it's always practical or worth the effort, or that
the Wikipedia team have the time to tackle that at this moment.

~~~
conradlee
Hey there, I wrote that post. It's a valid criticism you make, but the main
point of the article was _not_ to pick on Wikipedia, which is actually way
ahead of most other organizations in terms of testing its content.

We made a case-study Wikipedia because their banners are a well-known example
of A/B testing. Plus, because their banners make tens of millions of dollars
each year, and they have an entire team of people working on the A/B tests
over a period of months, I think they have capacity to use something more
sophisticated. Maybe, like you suggest, they're just focused on other aspects
of the fundraising campaign. I guess the only way to know would be to hear
from some of the Wikipedia team!

------
clarle
Really interesting concept, but what other information can you grab from a
user who has never visited your site before?

There's the example of location, and I guess you can also get browser
information (mobile device vs. desktop), but other than that, I feel there
isn't that much in terms of context to work with with first-time anonymous
users, which is the most common case for A/B testing.

Would love to hear how else you're classifying users, and what other buckets
you're using.

~~~
michaelt
You could divide articles into various categories.

Maybe people reading computer science articles prefer light blue, while people
reading medicine articles prefer gold.

Of course, a quick look at Wikipedia's categories show there are many possible
article categorisation schemes (are sneakers footwear like stilettos or sports
equipment like cricket bats? Of course they're both), so this would introduce
many additional variables - more than you could exhaustively A/B test.

------
mcfunley
You're disregarding the necessity of a "small non-profit" not wasting time
running broken tests, which is not necessarily optimized by complicating the
bejeezus out of everything.

------
Houshalter
This is just advertising for their company, how does this algorithm actually
work?

------
filbertkm
stopped reading at "much of it is organized by volunteers" as that's not
really true anymore.

I'm sure the A/B testing can be improved but not at all convinced how it's
done now is "wrong".

