Hacker News new | past | comments | ask | show | jobs | submit login
20 lines of code that beat A/B testing (2012) (stevehanov.ca)
545 points by _kush on Apr 6, 2016 | hide | past | web | favorite | 157 comments



Here's what everyone is missing. Don't use bandits to A/B test UI elements, use them to optimize your content / mobile game levels.

My app, 7 Second Meditation, is solid 5 stars, 100+ reviews because I use bandits to optimize my content.

By having the system automatically separate the wheat from the chaff, I am free to just spew out content regardless of its quality. This allows me to let go of perfectionism and just create.

There is an interesting study featured in "Thinking Fast and Slow" where they had two groups in a pottery class. The first group would have the entirety of their grade based on the creativity of a single piece they submit. The second group was graded on only the total number of pounds of clay they threw.

The second group crushed the first group in terms of creativity.


Love that example, even if frequently retold.

http://blog.codinghorror.com/quantity-always-trumps-quality/


I tried to find the original source of the quality-vs-quantity pottery class story a while back. I think it originates in the book "Art and Fear" but in that book it reads like a parable rather than a factual event. I'm highly suspicious of whether this event actually happened. Anyone have solid evidence?


Having researched this myself moderately extensively, I do not believe that the event actually happened.

Moreover, the book "Thinking fast and slow" does not contain the word "pottery" (nor "ceramics").


yeah this is referenced all over the place (Derek Sivers, Jeft Atwood, Kevin Kelly) but always just this one paragraph and comes from this book http://kk.org/cooltools/art-fear/ don't see any references there either.


It was featured in "Thinking Fast and Slow", and those authors seem quite academically rigorous.


Yes they're very good at "seeming" the part. Until you check the cites/references.

Please excuse me for lumping Kahneman, Tversky and Taleb into the same bin (stating beforehand, feel free to dismiss or form your own opinions because of this). My justification is they cite eachother all the time, write about the same topics and are quoted as doing their research by "bouncing ideas off each other" (only to later dig up surveys or concoct experiments to confirm these--psych & social sciences should take some cues from psi research and do pre-registration).

This is now the second time I notice that one of their anecdotes posed as "research" doesn't quite check out to be as juicy and clear-cut (or even describing the same thing) as the citation given.

The other one was the first random cite I checked in Taleb's Black Swan. A statistics puzzle about big and a smaller hospital and the number of baby boys born in them on a specific day. The claim being research (from a metastudy by Kahneman & Tversky) showing a large pct of professional statisticians getting the answer to this puzzle wrong. Which is quite hard to believe because the puzzle isn't very tricky at all. Checking the cited publication (and the actual survey it meta'd about), turns out it was a much harder statistical question, and it wasn't professional statisticians but psychologists at a conference getting the answer wrong.

Good thing I already finished the book before checking the cite. I was so put off by this finding, I didn't bother to check anything else (most papers I read are on computational science, where this kind of fudging of details is neither worth it, nor easy to get away with).

Which is too bad because the topics they write about are very interesting, and worthwhile/important areas of research. I still believe it's not unlikely that a lot of their ideas do in fact hold kernels of truth, but I'm fairly sure that also a number of them really do not. And their field of research would be stronger for it to figure out which is which. Unfortunately this is not always the case for the juicy anecdotes that sell pop-sci / pop-psych books.


You should read silent risk (currently freely available as a draft on talebs website). It's mathematical, with solid proofs and derivations, so it doesn't really suffer from the same problems as The Black Swan. I had basically the same issues with the Black Swan that you did.


Thanks for the tip, I'll check it out! On that note, I also read Fooled by Randomness, an earlier work by Nicholas Taleb. It covers (somewhat) similar topics as Black Swan does, but from a bit more technical perspective. I found it a lot more pleasant and educational read. IIRC, it talks less about the financial world and rare catastrophal risks ("Black Swans"), and more about how humans are wired to reason intuitively quite well about certain types of probability/estimates/combinatorics, and particularly on how we really suck at certain others.

Fooled by Randomness also doesn't needlessly bash people for wearing ties or being French, like Black Swan does :) ... only later did I learn that Taleb was a personal friend of Benoit Mandelbrot (who was French), which put it in a bit more context (as a friendly poke, I'm assuming), but at the time I found it a bit off-putting and weird, as it really had nothing to do with the subject matter at all.


This is very interesting, thank you. Modern non-fiction is so frustrating.


that's an appeal to authority, sounds like @kens actually tried to verify this story. I'd be curious to hear more examples as well. I believe I heard something similar on You're Not So Smart podcast but that might have referenced the same example


I do think the example is bullshit, simply because you could throw ONE very thick pot and have it weight more than 50 pounds. Of course it's a horrible pot and you wouldn't get any better since you only made one piece, but it would get you an A since it would use more clay. Total # of pieces thrown would make more sense than pounds of clay used.


(Not that I think the anecdote is real, but) that's the idea -- the parable was about how the pot with the best quality was found in the 'quantity' team even though this wasn't their objective.


Forgive me if I've missed something but the original comment says the group that threw the most clay scored higher on creativity, not scored higher on pounds of clay thrown.


Guess you missed something. "The second group was graded on only the total number of pounds of clay they threw."


I don't understand what you mean by "use them to optimize your content" - how are you doing that with your app? Are you serving different messages to different groups of people? How are you grouping/testing/rating them?


My bandit system generates an ordered list of the content for each individual user. I then will track if the user came back tomorrow or churned. Yes, they may churn due to many other factors, but the signal of the content itself is strong enough.

In the past I have used the share rate to optimize, but I've realized that retention is more important.


i dont know the app but i assume he has a core action (-> which he iterates/churns out) and evaluates if people act (-> used as rating)

eg (guesses here)

new notification types "think for 9 seconds about a loved one" "what was your favorite vacation, go there"

if people act -> reward and reuse if not -> reuse less (only by random chance)

alternative to notification could be "completed medidation screen"


> The first group would have the entirety of their grade based on the creativity of a single piece they submit. The second group was graded on only the total number of pounds of clay they threw.

I feel like that works partly because an important part of practice the feedback loop between continually practicing and having a sense of whether you did well or not.

Your strategy of not evaluating your own work sounds a bit like mushing clay into shapes with a blindfold on and then tossing it in kiln before you even check whether or not it's shaped like a pot. The users can sort through them later!

If the end goal is just ending up with a volume of work that's been culled down to the better ones, I guess you still get that. But it's inherently different from the Thinking Fast and Slow example where they're in a class and the goal is to learn and get better, rather than see who's made the nicest pot by the end of the semester.


I get the same effect using bandits. I practice my craft by spewing out volume rather than focusing on quality. The penalty I pay for the bad content is negligible because the bayesian bandits cull them very quickly. I am learning and getting better, but it is because I am not paralyzed by perfectionism.


You might want to be careful with your conclusions.

We don't know from this study whether the second group was more or less likely to get trapped in the Expert Beginner phase of development.

You definitely don't get anywhere without practice, but you are likely to get nowhere fast without theory.


insightful analogy.

so then "Expert Beginner phase" = local minima, which might be reached quickly compared to a more systematic search (practice informed/infused with theory) and once in that local state, the "expert beginner" might be unable to escape because of over-reliance on small, poorly guided, step size; the real prize anyway is the global minima.


solid 5 stars, 100+ reviews because I use bandits to optimize my content.

It's great app, but were the ratings lower in the beginning before the optimization or how do you know the optimization helped the ratings? I'm asking because it seems the app could have good ratings regardless of the content optimization because it's a "feel good" app. Are there counterexamples of of other meditation apps were the UI is good but it has bad reviews because of their low quality content?


There were two epiphanies hidden in this comment. HN is good for one or two every once in a while, but for me personally, this struck gold. Thanks!!


Good point! What do you use to do in-app A/B testing?


I implemented my own that I'll be making available for others soon at improve.ai (the site isn't up yet)


cool! good luck. let me know what it is when it launches. I'm in the market for a good iOS sdk. @tylerswartz


>I am free to just spew out content regardless of its quality.

Oh, that's a great goal to have...


By focusing on volume, I get quality as a side effect. I'm very proud of the the app I've created and the feedback I get from my users:

"This app gives a text reminder to do what everyone wants to do: relax, love one's self and others, and bring peace and light into the world. I smile when the alert message appears and feel grateful to the makers of this app for creating a pleasant, quick way to meditate at the most stressful point of my day. I have recommended it to many people"

https://itunes.apple.com/us/app/7-second-meditation-daily/id...


Volume and efficiency are great, but nobody likes a shit cannon.


I actually think that's a great idea, because then the users and algorithm decide what's good for you, and it will hide the low quality content automatically.

You have to do that sort of "curation" anyways anytime you make something. You have to continually decide whether it's worth it to keep working on something, and then decide if it's good enough to release. People tend to be pretty bad judges of this (especially for more creative tasks. There are many examples of someone's most famous song/book/painting not being their own favorite, or even being their least favorite!).

So why not relieve some of that stress, and let the users pick what they really like, instead of you guessing for them?


"I promise, there is a diamond somewhere in this bathtub of shit".

Efficiency is great, but there still needs to be some quality standards.


Most probably the input is not random words, and instead orasis writes pieces aiming to have quality. He just doesn't bleed each and every word and accepts a maximum time per text piece, accepting that some variation in quality is inescapable, and that it is most efficient to allocate less time per piece so he gets more pieces. Note that this "less time" is relative, and probably does not go to zero. It's just orasis has found a level of effort per piece that is effective, and he relies on the algorithm so he does not need to establish a threshold below which he will not publish - as those pieces will be swallowed by the rest of higher quality corpus.

He might have leaned toward a little bit of hyperbole in "just spew out content regardless of its quality" for the purposes of clarity and expressiveness, which most of the readers have correctly parsed, but I don't think you have the right to tag orasis' output as shit based solely on the information of the page. That's a little bit rude, in fact.


I did a lot of A/B testing, but I think the examples that are used in a lot of articles about A/B testing are weird.

For small changes like change the color / appearance of a button, the difference in conversion rate is not measurable. Maybe if you can test with traffic in the range of >100K unique visitors (from the same sources), you can say with confidence which button performed better.

But how many websites / apps really have >100K uniques? If you have a long running test, just to gather enough traffic, changes are some other factors have changed as well, like the weather, weekdays / weekends, time of month, etc.

And if you have <100K uniques, does the increase in conversion pay for the amount of time you have invested in setting up the test?

In my experience, only when you test some completely different pages you'll see some significant differences.


You are right, and the guy that responded with a test of just under 150 conversions is a great example of exactly what people get wrong.

I have been doing design optimization for nearly a decade now and virtually every example I've seen made public is incredibly poor. Sometimes they do not even include any numbers, but just say, look first I had a 1% conversion rate and now I got 1.2%! I wish I had a good example of optimization case study, but I don't recall the last time I saw one.

Without giving anything away, I have done tests this year which required over a million impressions to find out in a statistically significant way not to make any changes.

My general theory is that a never optimized design with a confusing UI has a lot of low hanging fruit. You start cleaning up the bad elements and conversion rates can double or triple. Even if the sample size is less than ideal, these really big pops will be apparent.

Design knowledge is better now than it was in 2003. Mobile forces designers to use one column and leave out a lot of crap. There are a lot of good examples that get copied and good out of the box software. That means when you start optimizing, the low hanging fruit is gone and you need really big sample sizes. Once the low hanging fruit is gone, often those big samples just tell you not to make any changes.

Thinking about free-to-play mobile games recently (an area I have no experience in.) If 1% or fewer of users are converting you really do need a huge install base to beat your competitors at optimization. You need millions of users just to get to 20,000 or 30,000 paying players to test behavioral changes on. That means there actually is some staying power for the winners, at least for a while.


My professional advice to people who make these unmeasurable changes is: Then don't do them. The worst part about unmeasurable changes is you can't verify if it has a negative effect either and you're essentially saying "I'm focusing on making this change when it will have negligible impact on the business."

You have better things to do.


I agree, every article about A/B testing is talking about making small changes and measure them, while you should be measuring only big changes or nothing at all.


People read about Google testing different shades of blue on their home page and want to do the same on their website.

They tend to forget that Google has a billion unique users per day, and their website a few hundreds.


> the difference in conversion rate is not measurable

Wut?

Here's the results of me changing an "add to cart" button from a branded looking maroon button to a simple yellow (Amazon style) button: http://cl.ly/0d440I3T333m

That's 26% sales increase from changing a button's color.

If you've got a good eye for usability, your intuition is going to lead to a lot of fun and great results with A/B testing. If not, you'll futz with things that make no difference most of the time and eventually give up. Experimentation is not about testing random permutations (unless you've got an infinite amount of time). It's about coming up with a reasonable hypothesis and then testing it. I study a web page until I come to a conclusion in my mind that something could really use some improvement. Then, even though I'm sure of it in my mind, I test it, because "sure of it in my mind" is wrong about half the time.

One note to those who use Optimizely: you do need to wait at least twice as long as Optimizely thinks, because statistical significance can bounce around with too little traffic. Optimizely thought the above test was done at about 2000 views, which was far too little results to be conclusive, with only ~20 sales.


You really should question the statistical significance of those numbers.


Running their numbers:

Assuming the usual sqrt(n+1) error on a count, as n is low. Combining the uncertainties from 83/66 [0] gives 1.258 +- 0.231 (1sd). It's only just likely that it is an actual improvement, but there is about a 30% chance that the orignal was better.

[0] (((1+66^0.5)/66)^2+((1+83^0.5)/83)^2)^0.5x83.0/66

EDIT

If we also include the errors on the total counts and use the fraction 83.0/66/6362x6392, then the error is:

(((1+66^0.5)/66)^2+((1+83^0.5)/83)^2+((1+6362^0.5)/6362)^2+((1+6392^0.5)/6392)^2)^0.5x83.0/66/6362x6392

which shows an improvement of 1.264+-0.234. Stastically nout.


I'll add my own Bayesian analysis to the fray. Assuming a binomial, in Julia:

    using Distributions

    b_old = Beta(66+1, 6392-66+1)
    b_yel = Beta(83+1, 6362-83+1)

    N = 1000000
    # Sample from both distributions, count the fraction of samples that are better
    sum(rand(b_old, N) .> rand(b_yel, N)) / N
This yields 7.7% chance that the old one is better than "yellow". It's fascinating to see how we can get such different answers to a simple question.


Thanks for this analysis.

I did an A/B test on an older framework which didn't automate statistical significance at all, but the website was getting more than 2000-3000 orders per day, so after a single week we had enough data to determine that sales had increased by 36% (reduced the page's load time by almost half, changed the checkout to use Ajax, and a few other small changes.) without the need to quantify things. In fact, at the time, I didn't even know what "statistical significance" was... not that I know too much more about statistics now than I did then.

Anyway, in theory all of the exact p values, etc. matter, but in practice, the bottom line is all that matters, because the p values can change in a moment based on something I haven't factored in. That's where, at least for the time being, intuition still plays a great part in being actually correct, which is why we still have people with repeated successes.


Good point, there are always unmodeled factors and prior information, so the % is to be interpreted in context.


Why is that answer so different? It's saying there's a 90%+ chance that the new one is better. That's also his answer.


I get about the same result using my favorite online Bayesian split test calculator: http://www.peakconversion.com/2012/02/ab-split-test-graphica...


Specifically, I get p=0.15 for those numbers. And that's assuming that they're based on running the test for a fixed amount of time instead of stopping once they look good.

So: probably better, but should have run it longer.


The lady who owned the site ended up changing ecommerce platforms before the test could complete, due to issues with the software. Sadly, her "add to cart" buttons on her new site are again styled to her brand...

I wanted Optimizely to say it was 100% significant for a full week of it running before I ended the test, but the chart was interesting to me, because the conversion rate difference between the two remained the same for the entire test, rather than there being a specific period where "yellow" excelled.


> I wanted Optimizely to say it was 100% significant

That's not how statistical significance works...


You should contact Optimizely and let them know right now.

https://www.optimizely.com/contact/

They'll need to reeducate their statisticians right away!


The basic model that Optimizely uses is a Z-Test approximation of a binomial distribution. To run a proper experiment with that model, you should be calculating the sample size ahead of time, and then run it. Each visitor should be independent, and not affected by things like the day of the week, or the time of it. The end result tells you if the distributions are different, but not as much as one would think about the size of the differences. It also can't be 100. The normal distribution has an infinite range, so a finite limit can never capture 100% of it.

Optimizely is in a rough spot. People don't like having to think through experimental design, and they are really, really bad at reasoning about p-values. To try to fix the people part, they came out with the sequential stopping rule stuff (their "stats engine), but they never really published much justifying it. The other alternative would be to move the experiments into a Bayesian framework, but that has a lot of it's own problems. When they acquired Synference, that was one of the likely directions to take (along with offering bandits), but that didn't work out and those guys have since left.


[Disclaimer: I previously worked for Optimizely as predictive analytics PM - but no longer work there, and don't speak for the company.]

Optimizely has a bandit based 'traffic auto-allocation' feature in production on select enterprise plans [1]; bandits are excellent in a wide range of situations, and have many advantages, but like anything, have design parameters and there are some caveats you have to be aware of to make sure you are using them effectively.

On Frequentist and Bayesian: Optimizely's stats engine combines elements of both Frequentist and Bayesian statistics. They have a blog that tries to touch on this issue [2] But this is subtle stuff - and there are a lot of trade-offs, and different perspectives; look at the Bayesian/frequentist debate which has been going on for decades among statisticians.

But, FWIW, I definitely saw Optimizely as an organisation make a big investment to produce a stats engine which had the right trade-offs for how their customers were trying to test; and I think the end result was way more suitable than 'traditional' statistics were.

[1] https://help.optimizely.com/hc/en-us/articles/200040115-Traf... "Traffic Auto-allocation automatically adjusts your traffic allocation over time to maximize the number of conversions for your primary goal. [...] To learn more about how algorithms like this work, you might want to read about a popular statistics problem called the “multi-armed bandit.”"

[2] https://blog.optimizely.com/2015/03/04/bayesian-vs-frequenti... "Yet as we developed a statistical model that would more accurately match how Optimizely’s customers use their experiment results to make decisions (Stats Engine), it became clear that the best solution would need to blend elements of both Frequentist and Bayesian methods to deliver both the reliability of Frequentist statistics and the speed and agility of Bayesian ones."


Hi Fergal!

I didn't realize that the auto-allocation ever shipped, but I'm glad it finally did. Hopefully there was work done to empirically show that they solved a lot of the issues around time to convert and other messy parts of the data that killed earlier efforts, but I think everyone who knew about those was gone before you joined :)

There are very subtle issues with both frequentist and bayesian stats, which makes combining them sounds insane to me.

What are you up to these days?


> but they never really published much justifying it

Not that I'm trying to defend Optimizely (I'm not a huge fan, but for other reasons...).

I can't vouch for the quality either, but they did publish something about it[0] - that at least looks quite scientific. Happy to read any critique of course.

[0] http://pages.optimizely.com/rs/optimizely/images/stats_engin...


Latex is a wonderful way to make a marketing paper look like a scientific one. It doesn't accurately describe the method, but that isn't really its purpose. It's a more technical description of the blog post, meant for people using the product to understand some of the tradeoffs and get more accurate results.

They are still having people make very fundamentally flawed assumptions about the data, which results in incorrect conclusions, and they are still not presenting the results in a way that people correctly interpret them. That being said, those are really hard to solve, and models that would try to correct for them would likely require a lot more data and be overly conservative for more people.

What are your reasons for disliking Optimizely?


I was tempted to make a snarky comment about using LaTeX, but I'm not sure it's entirely fair. It doesn't seem like just a bunch of MarketingSpeak wrapped in LaTeX to be honest.

My issue with Optimizely are mainly how they essentially ditched us as (paying) customers. We were admittedly small-fish, but we were paying and were willing to pay more, but they switched to Enterprise-vs-Free without any middle grounds. Enterprise was way too expensive for us. Free didn't include essential features, so we were just stuck.

I ended up writing an open-source javascript A/B test client[0] (and recently also an AWS-lambda backend[1]), but it still has a way to go...

[0] https://github.com/Alephbet/alephbet

[1] https://github.com/Alephbet/gimel


Hi, this is Leo, Optimizely's statistician. If you're looking for a more scientific paper, maybe take a look at this one we wrote recently: http://arxiv.org/abs/1512.04922

Should have everything you would ever want to know about the method.

I agree with you that the problem of inference and interpretation between A/B data, algorithms, and the people who make decisions from them is a hard one and worth working on.

That said, I do think the two sources of error our stats engine addresses - repeatedly checking results, and cherry picking from many metrics and variations - did make progress in having folks correctly interpret A/B Tests. This did result in more conservative results, but the benefit was that the variations that do become significant are more trustworthy. I think this was absolutely the right tradeoff to make for our customers, and trustworthyness is a pretty important aspiration for stats/ML/data science in general.

Of course I did write the thing, so I'm not very impartial.


It may be marketing but I was able to implement a sequential A/B test based on it. Admittedly, I did need to do some work beyond merely copy/pasting an algorithm, but all I really needed to do was read their paper and some citations. I do believe that this document does describe a viable frequentist test and my implementation of it worked pretty well.

Disclaimer: I do stats work at VWO, an Optimizely competitor.

(Also if you want to read our tech paper, here it is: https://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technic... This describes our Bayesian approach, which we believe to be less likely to be wrongly interpreted by non-statisticians.)


Optimizely did an A/B test that showed that customers respond better to rounded-up numbers. ;-) the real world is messy.

Years ago my team's statistician did a competitive review of various AB test apps, and reported various ways in which the UIs make statistically invalid statements to the user.


You're wrong. What you are thinking about is Optimizely's "Chance to Beat Baseline" number. That's different from the statistical significance, which is a setting you can change on the Settings page.

Being smug and condescending really backfires when you don't know what you're talking about.


Bro... http://cl.ly/1b0B3Y3o1w09

> Being smug and condescending really backfires when you don't know what you're talking about.

How's that working out for you?


They probably do, but why would they want to do so, when doing it would make their results appear less conclusive?


There's a reason you'll only see their number say >99%


The fact that on the summary page there isn't a number showing the significance is very worrying. As the whole reason for it is to make a binary choice, there could be a 2 line chuck of js that just says "B might be better (60% probablity)".


" If you have a long running test, just to gather enough traffic, changes are some other factors have changed as well, like the weather, weekdays / weekends, time of month, etc."

I'd serve them at the same time, randomize which clients see which one.


Ofcourse, that's what A/B testing is about. But this only works if you have lots of traffic. The smaller the amount of traffic, the bigger the influence of offsite factors. Also, the results become more difficult to explain: https://news.ycombinator.com/item?id=11438098


And it turns out you randomly assigned B to a poorer user more often than not, and this make B worse...


That's why you do it properly randomized. Statistics tell you how much of a risk of your scenario (and similar) you are running at most.


seasonality doesn't really come into it with A/B split testing - if it changes for one group it changes for both.


To ignore seasonality requires assuming it has roughly the same effect on both groups. If they move in different directions or by very different amounts, you can't actually ignore it.

Simple example: you're comparing your standard site's theme with a new pumpkin orange theme. In October, the pumpkin orange theme might go great, whereas in December, it might perform worse. There's clearly a seasonality interaction you need to account for.

What you're thinking of is more like testing e-commerce button colors right after Christmas. After Christmas, it's likely that all your groups perform worse, but equally so.


Let's assume your site is now working with option A. Conversion rate is 5.7%, measured over a month.

You're now running an A/B for 1 week with option A and another version, option B. You get a conversion rate of 6.5% for option A and 5.9% for option B. Normally, you'd say that 6.5% is better than 5.9%. But how sure can you be if you don't control the other factors and both are performing better than before? How many visitors do you need to offset the influence of offsite factors?


Set your site up to be able to handle option A and option B at the same time, randomly assigning visitors to each. Now you don't need to control for anything, because you can just compare the A-visitors to the B-visitors.

In your case, option A started doing a lot better when you started running the test, which is moderately surprising, so maybe you messed something up in user assignment, tracking, or something. But if your framework is good, then it sounds like an external change. You still want to look at how A's rate and B's rate compare for the time period when you were randomizing, so 6.5% vs 5.9%, with the earlier 5.7% being more or less irrelevant.

You do need to do significance testing to see how likely you would be to get this wide a difference by chance. The easiest way to do that is to enter your #samples and #conversions into a significance calculator [1], and the p-value tells you how likely this is (assuming you didn't have any prior reason to expect one side of the experiment to be any better).

[1] Like this one https://vwo.com/ab-split-test-significance-calculator/ , but multiply the p-values it gives you by 2 to account for them incorrectly using a one-tailed test.


I think the goal with this new mechanism is having the script "automatically" promote your Christmas-themed creatives in December, and go back to your normal creatives in January.

It seems like you could "bucket" results by week, to get some kind of date factor considered. Of course it's also possible your Christmas creatives are simply garbage and will never get promoted, even when they are in-season. ;)


Novices also tend to gravitate towards "end-game" business metrics which have a lot more inherent variation than simple operational indicators.

For example - optimizing a content site for AdSense; many folks would gravitate to AdSense $$ as the target metric, which is admittedly an intuitive solution (since that's how you're ultimately getting paid).

But if you think about it....

AdSense Revenue =>

(1 - Bounce Rate) x Pages / Visit x % ads clicked x CPC

Bounce rate is binomial probability with a relatively high p-value (15%+), thus you can get statistically solid reads on results with a relatively small sample.

Pages / Visit is basically the aggregate of a Markov chain (1 - exit probability); also relatively stable.

% ads clicked - binomial probability with low p-value; large samples becomes important

$ CPC - so the ugly thing here is there's a huge range in the value of a click... often as low as $.05 for a casual mobile phone click or $30 for a well qualified financial or legal click (think retargeting, with multiple bidders). And you're usually dealing with a small sample of clicks (since the average % CTR is very low). So HUGE natural variation in results. Oh, and Google likes to penalty price sites with a large rapid increase in click-through-rate (for a few days), so your short term CPC may not resemble what you would earn in steady-state.

So while it may make ECONOMIC sense to use test $ RPM as a metric, you've injected tremendous variation into the test. You can accurately read bounce rate, page activity, and % click-through on a much smaller sample and feel comfortable making a move if you're confident nothing major has changed in terms of the ad quality (and CPC value) you will get.


Isn't that a good argument for using $$ as the metric to optimize for? If you're going to get wiped out by variations in behavior because highly-retargeted legal clicks are worth 500x more than mobile clicks, isn't that an important variable?

The problem I frequently wonder about is that you have to assume independence about the stable variables to be comfortable testing them. In reality, the bounce rate of a people who make you lots of money is probably driven by different factors than the bounce rate of the overall population.

I guess what you should really do is optimize the bounce rate / pages per visit / etc. for just the population of people that could make you money, but you don't typically have access to that information.


$$ can be done in a bandit setting, but the key challenge is that your feedback is highly delayed (maybe weeks or months).

As the parent poster says, its best to focus on individual funnel steps that provide fast feedback, at least initially.

Once the whole funnel is optimized (does this ever happen?), you could start feeding in end-to-end $$ metrics.


I thought this was a pretty good follow up to show the strengths and weaknesses of this approach: https://vwo.com/blog/multi-armed-bandit-algorithm/. Personally I think this approach makes a lot more sense than a/b testing especially when often people hand off the methodology to a 3rd party without knowing exactly how they work.


Here are 2 good articles that follow up on the arguments presented by VWO in that article.

From the first link below: "They do make a compelling case that A/B testing is superior to one particular not very good bandit algorithm, because that particular algorithm does not take into account statistical significance.

However, there are bandit algorithms that account for statistical significance."

* https://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs...

* https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm...


Chris is now VWO's director of data science. We recently overhauled our stats. Here's a quick summary for that: https://vwo.com/blog/smartstats-testing-for-truth/


Interesting. Lack of Bayesian testing is one of the reasons I've never considered using VWO - nice to see that's now your main methodology.


Changing you users UI always bears some cost, though, so I'm not sure it's really the smart choice.


The points raised are valid, if they matter is a different beast

Even in the tests shown, conversion rate was higher for the MABA algorithms than simple A/B testing. "Oh but you get higher statistical significance!" thanks, but that doesn't pay my bills, conversion pays.


Careful. It wasn't always higher for MAB even though the tables shown there make it appear so at first.

Those tables are showing the conversion rate during the test, up to the time when statistical significance is achieved. You generally then stop the test and go with the winning option for all your traffic.

In the two-way test where the two paths have real conversion rates of 10% and 20%, all of the MAB variations did win. Here is how many conversions there would be after 10000 visitors for that test, and how they compare to the A/B test:

  RAND   1988
  MAB-10 1997  +9
  MAB-24 2001 +13
  MAB-50 1996  +8
  MAB-90 1994  +6
For the three-way test where the three paths have real rates of 10%, 15%, and 20%, here is how many conversions there would be after 10000 visitors:

  RAND   1987
  MAB-10 1969 -18
  MAB-50 1987  +0
  MAB-77 1988  +1
Note that MAB-10 loses compared to RAND this time.

(The third column in the above two tables remains the same if you change 10000 to something else, as long as that something else. MAB-10 beats RAND in the first test by 9 conversions, and loses by 18 conversions in the second test).


> up to the time when statistical significance is achieved. You generally then stop the test

Just a note, don't literally do this:

http://conversionxl.com/statistical-significance-does-not-eq...


Just to reiterate, this violates the assumptions under which you get your p-values.

I want an A/B testing tool that won't let you see results until it's done.


Interesting

This suggest to me that, similarly to a lot of algorithms you might want to change your parameters during training

So start with MAB-100 (RAND) and then decrease that % over time


The counterargument would be: in the long run, random positive changes in clicks don't pay the bills. Systemic changes do.


Yes and you don't get systemic change with A/B testing

I find it funny when some people think A/B (or MAB) testing will solve major usability problems

You can definitely try Genetic Programming your website to conversion, it's probably going to be fun to watch


    10% of the time, we choose a lever at random. The
    other 90% of the time, we choose the lever that has
    the highest expectation of rewards. 
There is a problem with strategies that change the distribution over time: Other factors change over time too.

For example let's say over time the percentage of your traffic that comes from search engines increases. And this traffic converts better then your other traffic. And let's say at the same time, your yellow button gets picked more often then your green button.

This will make it look like the yellow button performs better then it actually does. Because it got more views during a time where the traffic was better.

This can drive your website in the wrong direction. If the yellow button performs better at first just by chance then it will be displayed more and more. If at the same time the quality of your traffic improves, that makes it look like the yellow button is better. While in reality it might be worse.

In the end, the results of these kinds of adaptive strategies are almost impossible to interpret.


I don't know if this is the case if I understand this algorithm correctly. Say Yellow is 50%, and Green is a 65% success rate after the behavior change, but green is 30% before the behavior change.

By sending 90% of traffic towards yellow, it's ratio will normalize towards the 50% once it has enough traffic. By sending 10% of traffic randomly, eventually the green option will reach 51%, and start taking a majority of traffic, which then will cause it to normalize at it's 65%, and be shown to a majority of users.

I think the problem might be, if you run this with a sufficiently high volume or for a long period of time, that if a behaviour change takes place it will take a long time to learn the new behaviour. Or if two options aren't actually different, it may continually flip back and fourth between two options.

Also, to me, the concept of A/B testing certain things may also have an undesired consequence. For example, I order from amazon every day, but today the but button is blue, what does that actually mean? And I go back to the site later and it's yellow again. There are still many people who get confused by seemingly innocuous changes with the way their computer interacts with them.


> Also, to me, the concept of A/B testing certain things may also have an undesired consequence. For example, I order from amazon every day, but today the but button is blue, what does that actually mean? And I go back to the site later and it's yellow again. There are still many people who get confused by seemingly innocuous changes with the way their computer interacts with them.

Proper A/B tests are supposed to be done on a per-unique-user basis. If you access the shop from the same device or user account, a well-done A/B test should consistently show you the same interface.


I agree. Sorry in this context I was referring specifically to the "in 20 lines of code article" which I don't believe had this control to it.


Until that particular test ends, and another one begins.


This is why you should segment traffic and run separate tests for each segment, whether you're using an A/B testing or a multi-armed banding algorithm.


Indeed, otherwise you might be guilty of have Simpson's paradox either way.


It's possible to weight conversions by frecency to account for this, rather than using frequency alone.


Earlier discussion (989 points, 1407 days ago) https://news.ycombinator.com/item?id=4040022


I really wish hackernews would change the date format to days, months and years, rather than just days, or the original submission date as the title attribute of "1407 days ago".

Submission date is may 30, 2012


This is a good overview of the multi-arm bandit problem [1], but the author is far too dismissive of A/B Testing.

First of all, the suggested approach isn't always practical. Imagine that you are testing an overhaul of your website. Do you want daily individual visitors to keep flipping back and forth as the probabilities change? I'm not sure if the author is really suggesting his approach would be a better way to run drug trials, but that's clearly ridiculous. You have to recruit a set of people to participate in the study, and then you obviously can't change what drug you're giving them during the course of experiment!

Second, it ignores the time value of completing an experiment earlier. In the exploration/exploitation tradeoff, sometimes short-term exploitation isn't nearly as valuable as wrapping up an experiment so that your team can move to new experiments (e.g., shuting down the old website in the previous example). If a company expects to have a long lifetime, then, over the a time frame measured in weeks, exploration will likely be relatively far more valuable.

[1] https://en.wikipedia.org/wiki/Multi-armed_bandit


Regarding your first point, not sure if author covered it, but in Google Content Experiments with multi bandit approach cookies are stored, so user who sees variation b will keep seeing b while the experiment is running


This is the approach every good AB testing service/framework uses.


It is really funny how communities don't talk.

For instance, A/B testing with a 50-50 split has been baked into "business rules" framework from about as along ago as the Multi-armed bandit has been around, but nobody in that community has ever heard of the multi-armed bandit, and in the meantime, machine learning people are celebrating about the performance of NLP systems they build that are far worse than rule-based systems people were using in industry and government 15 years ago.


Which NLP system are far worse than which rule based systems?

The statement is odd for two reasons. One is that plenty of NLP is rule based, the other is that NLP isn't a form of A/B testing, which is the overall topic here..


> in the meantime, machine learning people are celebrating about the performance of NLP systems they build that are far worse than rule-based systems people were using in industry and government 15 years ago

That was not my experience. I have been researching chat bots and it would seem I can hardly find one implemented with machine learning but instead almost all are rules based. I was quite disappointed. ML for NLP is just gearing up.


I like the premise of this a lot, but it seems to me that the setting that the author chose (some UI element of a website) is one of the worst possible settings for this: what matters a whole lot more than if your button is red or green or blue is some modicum of consistency.

If you're constantly changing the button color, size, location, whatever... that is an awful experience in and of itself, is it not? If the Amazon "buy now" button changed size / shape / position / color every time I went to buy something, I would get frustrated with it pretty quickly.


One aspect of testing they leave unsaid is you identify your users (cookie, most commonly) to make sure each user always gets the same experience. That's why your numbers are all based on unique users, not merely users.

Their experience will still change once their cookie expires, but that amount of time is completely under your control.


If you cookie the users and make sure the experience they saw is persistent that solves most of this problem. But if you run a lot of separate tests than it's hard to avoid this.


Interesting but since the writing of this article (2012) Google do offer Multi Armed Banned Content Experiments... https://support.google.com/analytics/answer/2844870?hl=en


There is at least one caveat with multi-armed bandit testing. It assumes that the site/app remains constant over the entire experiment. This is often not the case or feasible, especially for websites with large teams deploying constantly.

When your site is constantly changing in other ways, dynamically changing odds can cause a skew because you could give more of A than B during a dependent change, so you have to normalize for that somehow. A/B testing doesn't have this issue because the odds are constant over time.


When thinking about what type of approach is best, first think about the nature of the problem. First is it a real optimization problem, IOW are you more concerned with learning an optimal controller for your marketing application? If so then ask: 1) Is the problem/information perishable - for example Perishable: picking headlines for News articles; Not Perishable: Site redesign. If Perishable then Bandit might give you real returns. 2) Complexity: Are you using covariates (contextual bandits, Reinforcement learning with function approximation) or not. If you are, then you might want your targeting model to serve up best the predicted options in subspaces (frequent user types) that it has more experiences in and for it to explore more in less frequently visited areas (less common user types). 3) Scale/Automation: You have tons of transactional decision problems, and it just doesn't scale to have people running many AB Tests.

Often it is a mix - you might use a bandit approach with your predictive targeting, but you also should A/B tests the impact of your targeting model approach vs a current default and/or a random draw. see slides 59-65: http://www.slideshare.net/mgershoff/predictive-analytics-bro...

For a quick bandit overview check out: http://www.slideshare.net/mgershoff/conductrics-bandit-basic...


Offtopic, but why do i have to enable Javascript to even see anything?


That is really weird. Technically the content is there all along (so it's not loaded in by JavaScript) but you still have to have JavaScript enabled for it to render. Who designed that!?

Edit: hahaha what. It appears the content is laid out with JavaScript. So basically they're using JavaScript as a more dynamic CSS. Let that sink in. They're using JavaScript as CSS.

It sorta-kinda makes sense for the fancy stream of comments but still... why is it a requirement!?


Very funny. They have a div #main with visibility:hidden. Removing that rule from dev tools displays the full page without enabling JS. The comments block is a solved problem with CSS. Basically any grid layout toolkit does that as their very first demo.


Yeah, if you are going to do that just for a fade-in transition at least set it to hidden with an inline script block then someone with script turned off they get the content. You block rendering until everything is downloaded, but then so does your needless transition.

If you are worried about it breaking in some browsers because you are modifying a tag that has not yet had its content completed so isn't accessible in the DOM yet, or because you use a DOM manipulation library that hasn't loaded yet due to lazy loading, have the script add an extra wrapper instead of modifying the existing tag [i.e. <script>document.write('<div id="iamanidiot" style="display:hidden">';</script>) directly after <body> and <script>document.write('</div>';</script> before </body>]. Or, of course, just don't...


> Or, of course, just don't...

I think that's really the takeaway here. Bytes and bytes of JavaScript, slower rendering time, and for what? Just say no!


Apparently some sort of useless fade-in effect is more important than the page content. This is the fix:

    #main {
        visibility: visible !important;
    }


Their trick also keeps noscript weirdos out though, which is a plus.


No, it didn't. I'm not about to run random Javascript, and the page was visible within ~3 clicks.

Page that are actually broken without Javascript only indicate that the author was lazy and unprofessional. Progressive enhancement is not hard.


>Page that are actually broken without Javascript only indicate that the author was lazy and unprofessional.

Or that they have kept with the times and understand the notion of opportunity cost.

>Progressive enhancement is not hard.

No, just redundant and for marginal benefit. The train has long left that station.

Search Engines (well, those that matter) and screen readers for a11y both work with JS.


> Or that they have kept with the times and understand the notion of opportunity cost.

That's not keeping up with times, that's poisoning the well because others do it too. By making such choices a site author is making the Internet worse for everyone.

I'm starting to think there should be some kind of "how not to be an ass on the Internet" course that would be mandatory before you're allowed to do business on-line.


Actually, weirdos are people making such pages. There is no valid reason for the content here to be not visible without JavaScript on. It's laziness and/or stupidity.


In Firefox, "View", "Page Style", "No Style". I assume other browsers let you do the same thing. This works for many pages that fail to render without JavaScript.


That's actually not off-topic. I wouldn't take web programming advice from someone who thinks displaying a blog article requires javascript.


The same reason you have to have electricity to watch TV.

It's a required part of modern web sites -- and it doesn't matter whether it's "really needed" for any particular site or not.


Bandits are great, but using the theory correctly can be difficult (and if accidentally incorrectly applied then ones results can easily become pathologically bad). For instance, the standard stochastic setup requires that learning instances are presented in an iid manner. This may not be true for website visitors, for example different behaviour at different times of day (browsing or executing) or timezone driven differing cultural responses. There is never a simple, magic solution for these things.


So I googled A/B testing vs Multi-Armed Bandit, and ran into an article that's a useful and informative response to the OP: https://vwo.com/blog/multi-armed-bandit-algorithm/

edit: Ah, 'ReadingInBed beat me to it. tl;dr: Bandit approaches might not _always_ be the best, and they tend to take longer to reach a statistically significant result.


I remember reading this post back in 2012. I ended up getting a copy of the Bandit Algorithms book by John Myles White.

Its short, but it covers all the major variations.


One of the best talks I've ever seen was Aaron Swartz talking about how Victory Kit used multi-armed bandits to optimize political action: https://github.com/victorykit/whiplash


Just a warning, this isn't a magic bullet to replace all A/B testing. This is great for code that has instant feedback and/or the user will only see once, but for things where the feedback loop is longer or the change is more obvious or longer lasting (like a totally different UI experience), it doesn't work so well.

For example, if your metric of success is that someone retains their monthly membership to your site, it will take a month before you start getting any data at all. At that point, in theory almost all of your users should already be allocated to a test because hopefully they visited (and used) their monthly subscription at least once. So it would be a really bad experience to suddenly reallocate them to another test each month.


This presumes a few of things about the decision being tested, many of which aren't always true.

I ran a few basic A/B tests on some handscanner software used in large warehouses. The basic premise is that the user is being directed where to go and what items to collect. The customer wanted to know how changes to font size and colour of certain text would improve overall user efficiency. But the caveat was that we had to present a consistent experience to the user- you can't change the font size every 10 seconds or it will definitely degrade user experience!

My point is that it sounds as though the 1-armed bandit will probably work great provided the test is short, simple, and the choice can be re-made often without impacting users.


It's possible to assign a user to a category once a week, and use the same algorithm, rebalancing users between the categories as needed once a week.


Reinforcement approaches are certainly interesting, but one of the things missing here (and in most A/B stuff) is statistical significance and experimental power. If you have enough data, there are hand wavey arguments that this will eventually be right, but in the meanwhile, if there is some opportunity cost (say, imagine this is a trading algo trying to profit from bid/ask), you screwed yourself out of some unknown amount of profits. There actually ways of hanging a confidence interval on this approach which virtually nobody outside the signal processing and information theory communities know about. Kind of a shame.


Any chance you could point me to a reference? I'm doing research in this space and currently working on a paper which does exactly this for diagnostics of testing processes.


I'm not sure which thing I mentioned you need a reference on. For p-values on machine learning techniques, http://vovk.net/cp/index.html I'll eventually do a blog post on this subject; it's very good math that all ML people should know about, though Vovk, Schaefer and Gammerman write pretty dense articles.

For statistical power... "Statistical Power Analysis for the Behavioral Sciences" by Jacob Cohen.


Sorry, I intended to quote your last sentence, applying a confidence interval to a reinforcement learning system, especially with respect to multi-armed bandits / adaptive experiments, but if I have to dig in to some signal processing stuff I am happy to do that.

It seems the conformal prediction link has some relation. I will dig, thanks.


Yes, conformal prediction does exactly this.


A side note: I love the website UX/Design. Especially the flowing comments layout.


Author is confusing the map and the territory.

A/B testing is comparing the performance of alternatives. Epsilon-greedy is a (good) way to do that. it's better than the most common approach, but you're still testing A against B.


A post on why the bandit is not better: https://vwo.com/blog/multi-armed-bandit-algorithm/


This is only appropriate in certain situations. There are many business situations in which it's more appropriate to run a traditional A/B test and carefully examine and understand the results before making a business decision. Always blindly accepting the results of a bandit is going to explode in your face at some point.

There is no silver bullet, no free lunch. There is no algorithm that will beat understanding your domain and carefully analyzing your data.


Thought this was interesting, so I made it in C#. I could've optimised a bit more in the Choose function but regardless, this is super cool!

https://gist.github.com/Zyst/0da505007b0e8c21418247000f3e7d4...


What I love about this example is that its the same algorithm applied in two completely different areas. This algorithm (or variants of it) can be used in place of A/B testing. The same algorithm can be applied to game playing and you get Monte Carlo Tree search, the basis of AlphaGo.


A/B Tasty (most used platform in France I think, not affiliated) uses this approach :

http://blog.abtasty.com/en/clever-stats-finally-statistics-s...


"Like many techniques in machine learning, the simplest strategy is hard to beat." is a thoroughly ridiculous statement.

It should instead say "Like many techniques in machine learning, the simplest strategy is easiest to implement" as the title of the post (20 lines) makes it clear.


That's not thoroughly ridiculous.

For many problems in machine learning, k-nearest-neighbors and a large dataset is very hard to beat in terms of error rate. Of course, the time to run a query is beyond atrocious, so other models are favored even if kNN has a lower error rate.


According to [1], k-NN is pretty far from being the top general classifier. Admittedly, all the data sets are small, with no more than 130,000 instances. When does it start becoming "very hard to beat", and what are you basing that on?

1. http://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf


The vast majority of the data sets are very small (on the order of 100 to 1000 examples!). In fact, in the paper they discarded some of the larger UCI datasets.

It's not surprising that they found random forests perform better, even though the conventional wisdom is boosting outperforms (it's much harder to overfit with forests).


Not just easiest to implement -- literally "hard to beat".

Not "impossible to beat" mind you -- merely hard. Because the simplest strategies work so well in so many cases, that you need to go quite subtle and implement harder algorithms (doesn't matter if you just take them ready off of a lib for the argument) to see an improvement. 80-20, et al.


I've been using bandits in my mobile apps for the last 3 years and am now packaging them as a saas REST API. Message me if you want an early look.


Hey, I was working on the same thing. Wanna talk about it ?


Multivariate A/B testing lets you test combinations of multiple changes simultaneously.


there is a startup (unlaunched) built around this approach http://www.growthgiant.com


Why 10%? Sounds like a magic constant.


I believe thats the greed factor. 90% of the time you are using the winning version of the page/button what not.


isn't this how a/b testing works with google analyitcs?


I digress, but this site has an interesting comments layout..


John Langford and the team (Microsoft Research) have built a contextual bandit library in Vowpal Wabbit.

It can be used from active learning to changing webpage layouts to increase ad clicks. It has the best bounds out of all exploratory algorithms.

Structured contextual bandits that come with LOLS (another algorithm present in vowpal wabbit) is extremely powerful.

All for free under BSD3.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: