
20 lines of code that beat A/B testing (2012) - _kush
http://stevehanov.ca/blog/index.php?id=132
======
orasis
Here's what everyone is missing. Don't use bandits to A/B test UI elements,
use them to optimize your content / mobile game levels.

My app, 7 Second Meditation, is solid 5 stars, 100+ reviews because I use
bandits to optimize my content.

By having the system automatically separate the wheat from the chaff, I am
free to just spew out content regardless of its quality. This allows me to let
go of perfectionism and just create.

There is an interesting study featured in "Thinking Fast and Slow" where they
had two groups in a pottery class. The first group would have the entirety of
their grade based on the creativity of a single piece they submit. The second
group was graded on only the total number of pounds of clay they threw.

The second group crushed the first group in terms of creativity.

~~~
kens
I tried to find the original source of the quality-vs-quantity pottery class
story a while back. I think it originates in the book "Art and Fear" but in
that book it reads like a parable rather than a factual event. I'm highly
suspicious of whether this event actually happened. Anyone have solid
evidence?

~~~
orasis
It was featured in "Thinking Fast and Slow", and those authors seem quite
academically rigorous.

~~~
tripzilch
Yes they're very good at "seeming" the part. Until you check the
cites/references.

Please excuse me for lumping Kahneman, Tversky and Taleb into the same bin
(stating beforehand, feel free to dismiss or form your own opinions because of
this). My justification is they cite eachother all the time, write about the
same topics and are quoted as doing their research by "bouncing ideas off each
other" (only to later dig up surveys or concoct experiments to confirm these--
psych & social sciences should take some cues from psi research and do pre-
registration).

This is now the second time I notice that one of their anecdotes posed as
"research" doesn't _quite_ check out to be as juicy and clear-cut (or even
describing the same thing) as the citation given.

The other one was the first random cite I checked in Taleb's _Black Swan_. A
statistics puzzle about big and a smaller hospital and the number of baby boys
born in them on a specific day. The claim being research (from a metastudy by
Kahneman & Tversky) showing a large pct of professional statisticians getting
the answer to this puzzle wrong. Which is quite hard to believe because the
puzzle isn't very tricky at all. Checking the cited publication (and the
actual survey it meta'd about), turns out it was a _much_ harder statistical
question, and it wasn't professional statisticians but psychologists at a
conference getting the answer wrong.

Good thing I already finished the book before checking the cite. I was so put
off by this finding, I didn't bother to check anything else (most papers I
read are on computational science, where this kind of fudging of details is
neither worth it, nor easy to get away with).

Which is too bad because the topics they write about _are_ very interesting,
and worthwhile/important areas of research. I still believe it's not unlikely
that a lot of their ideas _do_ in fact hold kernels of truth, but I'm fairly
sure that also a number of them really do not. And their field of research
would be stronger for it to figure out which is which. Unfortunately this is
not always the case for the juicy anecdotes that sell pop-sci / pop-psych
books.

~~~
ves
You should read silent risk (currently freely available as a draft on talebs
website). It's mathematical, with solid proofs and derivations, so it doesn't
really suffer from the same problems as The Black Swan. I had basically the
same issues with the Black Swan that you did.

~~~
tripzilch
Thanks for the tip, I'll check it out! On that note, I also read _Fooled by
Randomness_ , an earlier work by Nicholas Taleb. It covers (somewhat) similar
topics as _Black Swan_ does, but from a bit more technical perspective. I
found it a lot more pleasant and educational read. IIRC, it talks less about
the financial world and rare catastrophal risks ("Black Swans"), and more
about how humans are wired to reason intuitively quite well about _certain_
types of probability/estimates/combinatorics, and particularly on how we
really _suck_ at certain others.

 _Fooled by Randomness_ also doesn't needlessly bash people for wearing ties
or being French, like _Black Swan_ does :) ... only later did I learn that
Taleb was a personal friend of Benoit Mandelbrot (who was French), which put
it in a bit more context (as a friendly poke, I'm assuming), but at the time I
found it a bit off-putting and weird, as it really had nothing to do with the
subject matter at all.

------
spiderfarmer
I did a lot of A/B testing, but I think the examples that are used in a lot of
articles about A/B testing are weird.

For small changes like change the color / appearance of a button, the
difference in conversion rate is not measurable. Maybe if you can test with
traffic in the range of >100K unique visitors (from the same sources), you can
say with confidence which button performed better.

But how many websites / apps really have >100K uniques? If you have a long
running test, just to gather enough traffic, changes are some other factors
have changed as well, like the weather, weekdays / weekends, time of month,
etc.

And if you have <100K uniques, does the increase in conversion pay for the
amount of time you have invested in setting up the test?

In my experience, only when you test some completely different pages you'll
see some significant differences.

~~~
butler14
seasonality doesn't really come into it with A/B split testing - if it changes
for one group it changes for both.

~~~
spiderfarmer
Let's assume your site is now working with option A. Conversion rate is 5.7%,
measured over a month.

You're now running an A/B for 1 week with option A and another version, option
B. You get a conversion rate of 6.5% for option A and 5.9% for option B.
Normally, you'd say that 6.5% is better than 5.9%. But how sure can you be if
you don't control the other factors and both are performing better than
before? How many visitors do you need to offset the influence of offsite
factors?

~~~
cbr
Set your site up to be able to handle option A and option B at the same time,
randomly assigning visitors to each. Now you don't need to control for
anything, because you can just compare the A-visitors to the B-visitors.

In your case, option A started doing a lot better when you started running the
test, which is moderately surprising, so maybe you messed something up in user
assignment, tracking, or something. But if your framework is good, then it
sounds like an external change. You still want to look at how A's rate and B's
rate compare for the time period when you were randomizing, so 6.5% vs 5.9%,
with the earlier 5.7% being more or less irrelevant.

You do need to do significance testing to see how likely you would be to get
this wide a difference by chance. The easiest way to do that is to enter your
#samples and #conversions into a significance calculator [1], and the p-value
tells you how likely this is (assuming you didn't have any prior reason to
expect one side of the experiment to be any better).

[1] Like this one [https://vwo.com/ab-split-test-significance-
calculator/](https://vwo.com/ab-split-test-significance-calculator/) , but
multiply the p-values it gives you by 2 to account for them incorrectly using
a one-tailed test.

------
sixtypoundhound
Novices also tend to gravitate towards "end-game" business metrics which have
a lot more inherent variation than simple operational indicators.

For example - optimizing a content site for AdSense; many folks would
gravitate to AdSense $$ as the target metric, which is admittedly an intuitive
solution (since that's how you're ultimately getting paid).

But if you think about it....

AdSense Revenue =>

(1 - Bounce Rate) x Pages / Visit x % ads clicked x CPC

Bounce rate is binomial probability with a relatively high p-value (15%+),
thus you can get statistically solid reads on results with a relatively small
sample.

Pages / Visit is basically the aggregate of a Markov chain (1 - exit
probability); also relatively stable.

% ads clicked - binomial probability with low p-value; large samples becomes
important

$ CPC - so the ugly thing here is there's a huge range in the value of a
click... often as low as $.05 for a casual mobile phone click or $30 for a
well qualified financial or legal click (think retargeting, with multiple
bidders). And you're usually dealing with a small sample of clicks (since the
average % CTR is very low). So HUGE natural variation in results. Oh, and
Google likes to penalty price sites with a large rapid increase in click-
through-rate (for a few days), so your short term CPC may not resemble what
you would earn in steady-state.

So while it may make ECONOMIC sense to use test $ RPM as a metric, you've
injected tremendous variation into the test. You can accurately read bounce
rate, page activity, and % click-through on a much smaller sample and feel
comfortable making a move if you're confident nothing major has changed in
terms of the ad quality (and CPC value) you will get.

~~~
trjordan
Isn't that a good argument for using $$ as the metric to optimize for? If
you're going to get wiped out by variations in behavior because highly-
retargeted legal clicks are worth 500x more than mobile clicks, isn't that an
important variable?

The problem I frequently wonder about is that you have to assume independence
about the stable variables to be comfortable testing them. In reality, the
bounce rate of a people who make you lots of money is probably driven by
different factors than the bounce rate of the overall population.

I guess what you should really do is optimize the bounce rate / pages per
visit / etc. for just the population of people that could make you money, but
you don't typically have access to that information.

~~~
orasis
$$ can be done in a bandit setting, but the key challenge is that your
feedback is highly delayed (maybe weeks or months).

As the parent poster says, its best to focus on individual funnel steps that
provide fast feedback, at least initially.

Once the whole funnel is optimized (does this ever happen?), you could start
feeding in end-to-end $$ metrics.

------
ReadingInBed
I thought this was a pretty good follow up to show the strengths and
weaknesses of this approach: [https://vwo.com/blog/multi-armed-bandit-
algorithm/](https://vwo.com/blog/multi-armed-bandit-algorithm/). Personally I
think this approach makes a lot more sense than a/b testing especially when
often people hand off the methodology to a 3rd party without knowing exactly
how they work.

~~~
raverbashing
The points raised are valid, if they matter is a different beast

Even in the tests shown, conversion rate was _higher_ for the MABA algorithms
than simple A/B testing. "Oh but you get higher statistical significance!"
thanks, but that doesn't pay my bills, conversion pays.

~~~
tzs
Careful. It wasn't always higher for MAB even though the tables shown there
make it appear so at first.

Those tables are showing the conversion rate _during_ the test, up to the time
when statistical significance is achieved. You generally then stop the test
and go with the winning option for all your traffic.

In the two-way test where the two paths have real conversion rates of 10% and
20%, all of the MAB variations did win. Here is how many conversions there
would be after 10000 visitors for that test, and how they compare to the A/B
test:

    
    
      RAND   1988
      MAB-10 1997  +9
      MAB-24 2001 +13
      MAB-50 1996  +8
      MAB-90 1994  +6
    

For the three-way test where the three paths have real rates of 10%, 15%, and
20%, here is how many conversions there would be after 10000 visitors:

    
    
      RAND   1987
      MAB-10 1969 -18
      MAB-50 1987  +0
      MAB-77 1988  +1
    

Note that MAB-10 loses compared to RAND this time.

(The third column in the above two tables remains the same if you change 10000
to something else, as long as that something else. MAB-10 beats RAND in the
first test by 9 conversions, and loses by 18 conversions in the second test).

~~~
hythloday
> _up to the time when statistical significance is achieved. You generally
> then stop the test_

Just a note, don't literally do this:

[http://conversionxl.com/statistical-significance-does-not-
eq...](http://conversionxl.com/statistical-significance-does-not-equal-
validity/)

~~~
disgruntledphd2
Just to reiterate, this violates the assumptions under which you get your
p-values.

I want an A/B testing tool that won't let you see results until it's done.

------
3dfan

        10% of the time, we choose a lever at random. The
        other 90% of the time, we choose the lever that has
        the highest expectation of rewards. 
    

There is a problem with strategies that change the distribution over time:
Other factors change over time too.

For example let's say over time the percentage of your traffic that comes from
search engines increases. And this traffic converts better then your other
traffic. And let's say at the same time, your yellow button gets picked more
often then your green button.

This will make it look like the yellow button performs better then it actually
does. Because it got more views during a time where the traffic was better.

This can drive your website in the wrong direction. If the yellow button
performs better at first just by chance then it will be displayed more and
more. If at the same time the quality of your traffic improves, that makes it
look like the yellow button is better. While in reality it might be worse.

In the end, the results of these kinds of adaptive strategies are almost
impossible to interpret.

~~~
kevin_nisbet
I don't know if this is the case if I understand this algorithm correctly. Say
Yellow is 50%, and Green is a 65% success rate after the behavior change, but
green is 30% before the behavior change.

By sending 90% of traffic towards yellow, it's ratio will normalize towards
the 50% once it has enough traffic. By sending 10% of traffic randomly,
eventually the green option will reach 51%, and start taking a majority of
traffic, which then will cause it to normalize at it's 65%, and be shown to a
majority of users.

I think the problem might be, if you run this with a sufficiently high volume
or for a long period of time, that if a behaviour change takes place it will
take a long time to learn the new behaviour. Or if two options aren't actually
different, it may continually flip back and fourth between two options.

Also, to me, the concept of A/B testing certain things may also have an
undesired consequence. For example, I order from amazon every day, but today
the but button is blue, what does that actually mean? And I go back to the
site later and it's yellow again. There are still many people who get confused
by seemingly innocuous changes with the way their computer interacts with
them.

~~~
TuringTest
> Also, to me, the concept of A/B testing certain things may also have an
> undesired consequence. For example, I order from amazon every day, but today
> the but button is blue, what does that actually mean? And I go back to the
> site later and it's yellow again. There are still many people who get
> confused by seemingly innocuous changes with the way their computer
> interacts with them.

Proper A/B tests are supposed to be done on a per-unique-user basis. If you
access the shop from the same device or user account, a well-done A/B test
should consistently show you the same interface.

~~~
kevin_nisbet
I agree. Sorry in this context I was referring specifically to the "in 20
lines of code article" which I don't believe had this control to it.

------
tristanj
Earlier discussion (989 points, 1407 days ago)
[https://news.ycombinator.com/item?id=4040022](https://news.ycombinator.com/item?id=4040022)

~~~
thecatspaw
I really wish hackernews would change the date format to days, months and
years, rather than just days, or the original submission date as the title
attribute of "1407 days ago".

Submission date is may 30, 2012

------
ryporter
This is a good overview of the multi-arm bandit problem [1], but the author is
far too dismissive of A/B Testing.

First of all, the suggested approach isn't always practical. Imagine that you
are testing an overhaul of your website. Do you want daily individual visitors
to keep flipping back and forth as the probabilities change? I'm not sure if
the author is really suggesting his approach would be a better way to run drug
trials, but that's clearly ridiculous. You have to recruit a set of people to
participate in the study, and then you obviously can't change what drug you're
giving them during the course of experiment!

Second, it ignores the time value of completing an experiment earlier. In the
exploration/exploitation tradeoff, sometimes short-term exploitation isn't
nearly as valuable as wrapping up an experiment so that your team can move to
new experiments (e.g., shuting down the old website in the previous example).
If a company expects to have a long lifetime, then, over the a time frame
measured in weeks, exploration will likely be relatively far more valuable.

[1] [https://en.wikipedia.org/wiki/Multi-
armed_bandit](https://en.wikipedia.org/wiki/Multi-armed_bandit)

~~~
davkap92
Regarding your first point, not sure if author covered it, but in Google
Content Experiments with multi bandit approach cookies are stored, so user who
sees variation b will keep seeing b while the experiment is running

~~~
mplewis
This is the approach every good AB testing service/framework uses.

------
PaulHoule
It is really funny how communities don't talk.

For instance, A/B testing with a 50-50 split has been baked into "business
rules" framework from about as along ago as the Multi-armed bandit has been
around, but nobody in that community has ever heard of the multi-armed bandit,
and in the meantime, machine learning people are celebrating about the
performance of NLP systems they build that are far worse than rule-based
systems people were using in industry and government 15 years ago.

~~~
wodenokoto
Which NLP system are far worse than which rule based systems?

The statement is odd for two reasons. One is that plenty of NLP is rule based,
the other is that NLP isn't a form of A/B testing, which is the overall topic
here..

------
chias
I like the premise of this a lot, but it seems to me that the setting that the
author chose (some UI element of a website) is one of the worst possible
settings for this: what matters a whole lot more than if your button is red or
green or blue is some modicum of _consistency_.

If you're constantly changing the button color, size, location, whatever...
that is an awful experience in and of itself, is it not? If the Amazon "buy
now" button changed size / shape / position / color every time I went to buy
something, I would get frustrated with it pretty quickly.

~~~
blakeyrat
One aspect of testing they leave unsaid is you identify your users (cookie,
most commonly) to make sure each user always gets the same experience. That's
why your numbers are all based on _unique_ users, not merely users.

Their experience will still change once their cookie expires, but that amount
of time is completely under your control.

------
davkap92
Interesting but since the writing of this article (2012) Google do offer Multi
Armed Banned Content Experiments...
[https://support.google.com/analytics/answer/2844870?hl=en](https://support.google.com/analytics/answer/2844870?hl=en)

------
hoddez
There is at least one caveat with multi-armed bandit testing. It assumes that
the site/app remains constant over the entire experiment. This is often not
the case or feasible, especially for websites with large teams deploying
constantly.

When your site is constantly changing in other ways, dynamically changing odds
can cause a skew because you could give more of A than B during a dependent
change, so you have to normalize for that somehow. A/B testing doesn't have
this issue because the odds are constant over time.

------
conductrics
When thinking about what type of approach is best, first think about the
nature of the problem. First is it a real optimization problem, IOW are you
more concerned with learning an optimal controller for your marketing
application? If so then ask: 1) Is the problem/information perishable - for
example Perishable: picking headlines for News articles; Not Perishable: Site
redesign. If Perishable then Bandit might give you real returns. 2)
Complexity: Are you using covariates (contextual bandits, Reinforcement
learning with function approximation) or not. If you are, then you might want
your targeting model to serve up best the predicted options in subspaces
(frequent user types) that it has more experiences in and for it to explore
more in less frequently visited areas (less common user types). 3)
Scale/Automation: You have tons of transactional decision problems, and it
just doesn't scale to have people running many AB Tests.

Often it is a mix - you might use a bandit approach with your predictive
targeting, but you also should A/B tests the impact of your targeting model
approach vs a current default and/or a random draw. see slides 59-65:
[http://www.slideshare.net/mgershoff/predictive-analytics-
bro...](http://www.slideshare.net/mgershoff/predictive-analytics-broken-down)

For a quick bandit overview check out:
[http://www.slideshare.net/mgershoff/conductrics-bandit-
basic...](http://www.slideshare.net/mgershoff/conductrics-bandit-
basicsemetrics1016)

------
thecopy
Offtopic, but why do i have to enable Javascript to even see anything?

~~~
kqr
That is really weird. Technically the content is there all along (so it's not
loaded in by JavaScript) but you still have to have JavaScript enabled for it
to render. Who designed that!?

Edit: hahaha what. It appears the content is laid out with JavaScript. So
basically they're using JavaScript as a more dynamic CSS. Let that sink in.
They're using JavaScript as CSS.

It sorta-kinda makes sense for the fancy stream of comments but still... why
is it a requirement!?

~~~
pmontra
Very funny. They have a div #main with visibility:hidden. Removing that rule
from dev tools displays the full page without enabling JS. The comments block
is a solved problem with CSS. Basically any grid layout toolkit does that as
their very first demo.

~~~
dspillett
Yeah, if you are going to do that just for a fade-in transition at least set
it to hidden with an inline script block then someone with script turned off
they get the content. You block rendering until everything is downloaded, but
then so does your needless transition.

If you are worried about it breaking in some browsers because you are
modifying a tag that has not yet had its content completed so isn't accessible
in the DOM yet, or because you use a DOM manipulation library that hasn't
loaded yet due to lazy loading, have the script add an extra wrapper instead
of modifying the existing tag [i.e. <script>document.write('<div
id="iamanidiot" style="display:hidden">';</script>) directly after <body> and
<script>document.write('</div>';</script> before </body>]. Or, of course, just
don't...

~~~
wtbob
> Or, of course, just don't...

I think that's really the takeaway here. Bytes and bytes of JavaScript, slower
rendering time, and for what? Just say no!

------
llull
Bandits are great, but using the theory correctly can be difficult (and if
accidentally incorrectly applied then ones results can easily become
pathologically bad). For instance, the standard stochastic setup requires that
learning instances are presented in an iid manner. This may not be true for
website visitors, for example different behaviour at different times of day
(browsing or executing) or timezone driven differing cultural responses. There
is never a simple, magic solution for these things.

------
saturdayplace
So I googled A/B testing vs Multi-Armed Bandit, and ran into an article that's
a useful and informative response to the OP: [https://vwo.com/blog/multi-
armed-bandit-algorithm/](https://vwo.com/blog/multi-armed-bandit-algorithm/)

edit: Ah, 'ReadingInBed beat me to it. tl;dr: Bandit approaches might not
_always_ be the best, and they tend to take longer to reach a statistically
significant result.

------
tmaly
I remember reading this post back in 2012. I ended up getting a copy of the
Bandit Algorithms book by John Myles White.

Its short, but it covers all the major variations.

------
creade
One of the best talks I've ever seen was Aaron Swartz talking about how
Victory Kit used multi-armed bandits to optimize political action:
[https://github.com/victorykit/whiplash](https://github.com/victorykit/whiplash)

------
jedberg
Just a warning, this isn't a magic bullet to replace all A/B testing. This is
great for code that has instant feedback and/or the user will only see once,
but for things where the feedback loop is longer or the change is more obvious
or longer lasting (like a totally different UI experience), it doesn't work so
well.

For example, if your metric of success is that someone retains their monthly
membership to your site, it will take a month before you start getting any
data at all. At that point, in theory almost all of your users should already
be allocated to a test because hopefully they visited (and used) their monthly
subscription at least once. So it would be a really bad experience to suddenly
reallocate them to another test each month.

------
mabbo
This presumes a few of things about the decision being tested, many of which
aren't always true.

I ran a few basic A/B tests on some handscanner software used in large
warehouses. The basic premise is that the user is being directed where to go
and what items to collect. The customer wanted to know how changes to font
size and colour of certain text would improve overall user efficiency. But the
caveat was that we had to present a consistent experience to the user- you
can't change the font size every 10 seconds or it will definitely degrade user
experience!

My point is that it sounds as though the 1-armed bandit will probably work
great provided the test is short, simple, and the choice can be re-made often
without impacting users.

~~~
zeckalpha
It's possible to assign a user to a category once a week, and use the same
algorithm, rebalancing users between the categories as needed once a week.

------
scottlocklin
Reinforcement approaches are certainly interesting, but one of the things
missing here (and in most A/B stuff) is statistical significance and
experimental power. If you have enough data, there are hand wavey arguments
that this will eventually be right, but in the meanwhile, if there is some
opportunity cost (say, imagine this is a trading algo trying to profit from
bid/ask), you screwed yourself out of some unknown amount of profits. There
actually ways of hanging a confidence interval on this approach which
virtually nobody outside the signal processing and information theory
communities know about. Kind of a shame.

~~~
gburt
Any chance you could point me to a reference? I'm doing research in this space
and currently working on a paper which does exactly this for diagnostics of
testing processes.

~~~
scottlocklin
I'm not sure which thing I mentioned you need a reference on. For p-values on
machine learning techniques,
[http://vovk.net/cp/index.html](http://vovk.net/cp/index.html) I'll eventually
do a blog post on this subject; it's very good math that all ML people should
know about, though Vovk, Schaefer and Gammerman write pretty dense articles.

For statistical power... "Statistical Power Analysis for the Behavioral
Sciences" by Jacob Cohen.

~~~
gburt
Sorry, I intended to quote your last sentence, applying a confidence interval
to a reinforcement learning system, especially with respect to multi-armed
bandits / adaptive experiments, but if I have to dig in to some signal
processing stuff I am happy to do that.

It seems the conformal prediction link has some relation. I will dig, thanks.

~~~
scottlocklin
Yes, conformal prediction does exactly this.

------
shade23
A side note: I love the website UX/Design. Especially the flowing comments
layout.

------
paulsutter
Author is confusing the map and the territory.

A/B testing is comparing the performance of alternatives. Epsilon-greedy is a
(good) way to do that. it's better than the most common approach, but you're
still testing A against B.

------
dalacv
A post on why the bandit is not better: [https://vwo.com/blog/multi-armed-
bandit-algorithm/](https://vwo.com/blog/multi-armed-bandit-algorithm/)

------
cle
This is only appropriate in certain situations. There are many business
situations in which it's more appropriate to run a traditional A/B test and
carefully examine and understand the results before making a business
decision. Always blindly accepting the results of a bandit is going to explode
in your face at some point.

There is no silver bullet, no free lunch. There is no algorithm that will beat
understanding your domain and carefully analyzing your data.

------
Zyst
Thought this was interesting, so I made it in C#. I could've optimised a bit
more in the Choose function but regardless, this is super cool!

[https://gist.github.com/Zyst/0da505007b0e8c21418247000f3e7d4...](https://gist.github.com/Zyst/0da505007b0e8c21418247000f3e7d40)

------
zardeh
What I love about this example is that its the same algorithm applied in two
completely different areas. This algorithm (or variants of it) can be used in
place of A/B testing. The same algorithm can be applied to game playing and
you get Monte Carlo Tree search, the basis of AlphaGo.

------
tarsinge
A/B Tasty (most used platform in France I think, not affiliated) uses this
approach :

[http://blog.abtasty.com/en/clever-stats-finally-
statistics-s...](http://blog.abtasty.com/en/clever-stats-finally-statistics-
suited-to-your-needs/)

------
aub3bhat
"Like many techniques in machine learning, the simplest strategy is hard to
beat." is a thoroughly ridiculous statement.

It should instead say "Like many techniques in machine learning, the simplest
strategy is easiest to implement" as the title of the post (20 lines) makes it
clear.

~~~
argonaut
That's not _thoroughly_ ridiculous.

For many problems in machine learning, k-nearest-neighbors and a large dataset
is very hard to beat in terms of error rate. Of course, the time to run a
query is beyond atrocious, so other models are favored even if kNN has a lower
error rate.

~~~
mortehu
According to [1], k-NN is pretty far from being the top general classifier.
Admittedly, all the data sets are small, with no more than 130,000 instances.
When does it start becoming "very hard to beat", and what are you basing that
on?

1\.
[http://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf](http://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf)

~~~
argonaut
The vast majority of the data sets are very small (on the order of 100 to 1000
examples!). In fact, in the paper they discarded some of the larger UCI
datasets.

It's not surprising that they found random forests perform better, even though
the conventional wisdom is boosting outperforms (it's much harder to overfit
with forests).

------
orasis
I've been using bandits in my mobile apps for the last 3 years and am now
packaging them as a saas REST API. Message me if you want an early look.

~~~
katafalkas
Hey, I was working on the same thing. Wanna talk about it ?

------
imaginenore
Multivariate A/B testing lets you test combinations of multiple changes
simultaneously.

------
gaz
there is a startup (unlaunched) built around this approach
[http://www.growthgiant.com](http://www.growthgiant.com)

------
mkj
Why 10%? Sounds like a magic constant.

~~~
martin_bech
I believe thats the greed factor. 90% of the time you are using the winning
version of the page/button what not.

------
wnevets
isn't this how a/b testing works with google analyitcs?

------
sideproject
I digress, but this site has an interesting comments layout..

------
pitchka
John Langford and the team (Microsoft Research) have built a contextual bandit
library in Vowpal Wabbit.

It can be used from active learning to changing webpage layouts to increase ad
clicks. It has the best bounds out of all exploratory algorithms.

Structured contextual bandits that come with LOLS (another algorithm present
in vowpal wabbit) is extremely powerful.

All for free under BSD3.

