I like the epsilon-greedy algorithm because it's simple to understand and implement, and easy to extend. However, to claim "The strategy that has been shown to win out time after time in practical problems is the epsilon-greedy method" is false. The standard measure of performance is called regret. You can think of it as the number of times you choose the sub-optimal choice. It is clear that this grows linearly in e-greedy, as there is a constant probability of exploring. The same is true in A/B testing (you show 1/2 the people the suboptimal choice in the data gathering phase and then make a decision that you have some probability of getting wrong.) A good bandit algorithm has regret that grows logarithmically with time, which is a huge difference! This result holds out in practice as well. If you look at some of Yahoo's papers (John Langford, for example; sorry no links as writing this while getting the kids ready!) you'll see comparisons to e-greedy where they significantly out-perform it. We've had the same results in our testing.
To that end, btw, I think a service like yours is potentially quite valuable!
We collect a variety of metrics. When we do an A/B test, we look at a all of them as a way of understanding what effect our change has on user behavior and long-term outcomes.
A particular change may be intended to effect just one metric, but that's in an all-else-equal way. It's not often the case that our changes affect only one metric. And that's great, because that gives us hints as to what our next test should be.
+1 on an article does not mean "I agree". It means "I learnt something".
The problem with this article is that it's a) not very detailed, and b) the conclusion is linkbait. Bandit optimization is a useful tool, but it has drawbacks, and it's not always better than A/B testing. In particular, bandit approaches take longer to converge (on average), and don't give you reliable ways to know when to stop testing (when all you know is that you're using an approach that's optimal in the limit of a large N, your only guarantee is that things get better as N gets large). These techniques also make assumptions that aren't valid for a lot of web experiments: identical "bandit" distributions that are constant over time. Throw a few choices that are optimal at different times of day/week/month/year at a bandit optimizer, and it'll just happily fluctuate between them.
Also, there's a lot of variation in performance depending on the parameters of your test -- some of which are completely unknowable. So if you want to really learn about this method, you need to read more than a blog post where the author has concluded that bandit optimization is the new pink. For example, here's a pretty readable paper that does an empirical analysis of the various popular bandit algorithms in different paramterizations:
This is just one article, but there's tons of literature on this problem. (In fact, if you use the 'softmax' criterion mentioned in that article, you're doing something very similar to simulated annealing, which is a rather elderly optimization technique.)
This type of error (which can happen very easily on a website going through continuous improvement) can take a very long time to recover from.
A/B tests do not have an issue with this because all versions will have similar mixes of data from before and after the improvement.
Better yet, make x dependent in some way on time since the last change, and relative change in performance of all options from before and after the change.
My problem with the bandit method is that I want to show the same test choice to the same person every time he sees the page so you can hide that there is a test. If I do this with the bandit algo then it warps the results because different cohorts have different weightings of the choices and differing cohorts behave very differently for lots of reasons.
A forgetting factor will help.
This is a variant of the cohort issue that you're talking about.
The cohort issue that you're talking about raises another interesting problem. If you have a population of active users, and you want to test per user, you often will find that your test population ramps up very quickly until most active users are in, and then slows down. The window where most users are assigned is a period where you have poor data (you have not collected for long, users have not necessarily had time to go to final sale).
It seems to me that if you want to use a bandit method in this scenario, you'd be strongly advised to make your fundamental unit the impression, and not the user. But then you can't hide the fact that the test is going on. Whether or not this is acceptable is a business problem, and the answer is not always going to be yes.
Absolutely. That's my biggest objection to bandit methods, but it's also the fuzziest objection, and the one least likely to appeal to hyper-analytical people. There's a strong temptation (as we can see from this article) is to treat bandit optimization as a black box that just spits out an infallible answer (i.e. as a Lazy Button).
It's the same human tendency that has led to "you should follow me on twitter" to be one of the more common n-grams on the interwebs (even though it probably never worked for more than Dustin Curtis, and likely causes a counter-intuitive backlash now).
Of course, these algorithms are cool in certain contexts where you never want to think about it again. For example, ad placement for a giant ad network running against a lot of UGC content. They remind me of neural networks like that: the post office doesn't care how the machine recognizes hand-written zip codes, just as long as they do it reliably.
But startups are all about learning, and interface design is where you have direct contact with users. I want to soak in the data. I want to hang out in people's living rooms when they're using my site. A fancy Lazy Button (great phrase) is the last thing I need.
The A/B test just gives you the conversion rate of each option. And so does the bandit. As I understand, the only difference is that the bandit will be bugging your users with bad results less often.
Of course, the real answer to why not is "we have an A/B system and I'm not going to add bandit stuff for no benefit". But even if I were doing it from scratch, it seems more complex. The benefit of these approaches seems to be that one no longer has to think. We want to think.
No, because the assumptions that underpin many statistical techniques are violated when you're not assigning people to cohorts consistently, and at random.
The standard confidence tests -- t-tests, G-tests, chi-squared tests, etc. -- based on distributions of independent, identically distributed (iid) data.
I'd have to think about it more, but I believe that btilly's examples are also the most intuitive reasons why independence matters. If your data is time-dependent, then assigning users to cohorts based on past performance lets the time dependency dominate. There may be other good examples.
Stopping a test when you reach a "statistically significant" result is the wrong way to do A/B testing. In both multi-armed bandit and A/B testing you need to set ahead of time the number of users you are going to run your test against and stop the test at that point regardless of if your result is significant or not.
See http://elem.com/~btilly/effective-ab-testing/index.html#asli... for part of a presentation that I did where I actually set up some reasonable fake tests, and ran simulations. What I found is that if there is a significant difference, the probability of coming to the wrong conclusion was (as you would expect) higher, but not that high before the underlying difference made mistakes incredibly unlikely. Conversely if there is only a small real difference, the amount of data needed before you have a significant chance of having accidentally come to a erroneous conclusion is very, very long.
So avoid accepting any result where you don't have at least a few hundred successes and set your thresholds reasonably high. You will make fewer mistakes than you probably fear, and the ones that you make will almost always be very minor. (Oops, I had a 3% chance of accepting the 1% worse solution as probably better.)
Of course if you're aiming to publish academic research, your standards need to be higher. But if you're more interested in getting useful results than publishable ones, you can relax your standards. A lot.
Nobody said that it was. But when you do regular split testing, you can use power analysis to estimate the length of time you need to run an experiment to get a significant result at a certain precision:
You can't do this (at least, not easily) when you're using bandit models, because none of the assumptions are valid.
But I usually upvote stories based on how interesting they are and whether I feel like they were worth attention. Comments are a mix bag of agreeing, interesting and just well thought out arguments that make me think.
This is completely meta and kind of offtopic, so it should probably be downvoted, but I felt like saying it anyway.
As in meat-space voting, you should always reward good behavior.. be part of the fitness function.
1. Real world performance varies over time. For instance there are typically daily, weekly and monthly conversion rate fluctuations. Not an issue for A/B testing, but a big issue for this approach if a random switch in direction happens at the same time that conversion fluctuations happen to head in a good direction.
2. (This is really a special case of #1 - but a very, very important special case.) This approach creates long-lasting interaction effects between tests and independent changes. That requires explanation. Suppose you're running a test. Version A is somewhat better. But version B is temporarily looking slightly better when you make a significant improvement to your website (maybe you started another test that works). Now you're adding a lot of good traffic to version B (good because of the other change) and very little of the new good traffic to version A. This new version B traffic soundly beats your old version A traffic. This correlation between time and website performance will continue until the old version A traffic is completely swamped by new version A traffic. With only 5% of your traffic going to version A, this can easily take 100x as long as your test has been running - or more. (Properly constructed A/B tests do not suffer this statistical anomaly.)
3. Code over time gets messy. One of the most important characteristics of A/B testing is that you can delete the mess and move on. With this approach you can't - it just hangs around adding to your technical debt.
4. Businesses are complex, and often have multiple measures they would like to balance. For instance in a recent test, conversion to click was hurt, conversion to a person who clicked 5x was helped. A/B testing let us notice that something weird was going on and think about what we really cared about. This automated approach would make a decision and could have hidden a real problem.
5. Many tests perform differently on existing users and new users. A/B testing with proper cohort analysis can let you tease this out and decide accordingly. This approach doesn't give you that kind of sophistication.
Firstly both you and the author aren't really quoting any maths or real world papers or anything. He's backing his up with saying that all the advertisers are using this over A/B though, which is a pretty strong argument. But it occurs to me that for most of your points to stand you need to tackle this particular paragraph:
Like many techniques in machine learning, the simplest strategy is hard to beat. More complicated techniques are worth considering, but they may eke out only a few hundredths of a percentage point of performance. The strategy that has been shown to win out time after time in practical problems is the epsilon-greedy method.
So to tackle you points:
1. Only stands if you show his paragraph above to be wrong. Does epsilon-greedy only work on consistent payouts, or does it work on fluctuating payouts too? It would seem to me that this would be a common occurrence in advertising on websites. I imagine there is some research out there to settle this!
2. He addresses this directly in the post: This won't adapt to change. (Your visitors probably don't change. But if you really want to, in the reward function, multiply the old reward value by a forgetting factor)
3. There is no difference between this and A/B testing, the mock code he shows is supposed to go in your A/B testing framework, the code in the controllers is supposed to be the same (and you can remove it the same way).
4. Isn't A/B testing just as bad at testing multiple factors? Why wouldn't you 'notice', you should theoretically see the same percentages for each stage. And would be able to notice the oddity.
5. Again this only stands if you show his paragraph above to be wrong. You are suggesting that a complicated strategy will win, which he says isn't true.
The difference in this case is not a few hundredths of a point of convergence. It is a question of potentially drawing wrong conclusions about the wrong version for 100x as long as you need to.
If your really concerned about rapidly changing events just add a diminishing return. AKA multiply both the success and failure number by say .9999 after each test.
so 34/2760 = 34.9966/2760.724 on success or 33.9966/2760.724 on a failure.
As to diminishing factor you diminish both the numerator and the denominator for a bucket every time you test that bucket. If you want something next to perfect try http://en.wikipedia.org/wiki/Bayesian_statistics, but that eat's a lot of CPU and is harder to code for minimal gain.
mute == "does not talk"
moot == "of little or no practical value or meaning; purely academic."
moot/mute is a tricky word combo because smart people can justify using "MUTE" instead of moot, and english is such a terrible language w.r.t. spelling there are no good clues to use one versus the other (and "moot" is far less commonly used).
I've seen minor tweaks to a form raise conversions by 20%. The person running a particular test may not even know that a particular change was significant, or even that it happened. The change could be as subtle as another running test realized that version x is better, interfering with existing tests.
As for #5, with an A/B test you run into these situations, you're able to break down and crunch the numbers in multiple ways, and then have a discussion about how you want to proceed. But with a multi-armed bandit approach whatever complexities you have not thought of and baked into your approach, are not going to be noticed.
One approach would be to ignore the data from the 90% exploitation; that way, you only get 10% of the data, but its slice assignment is completely random and uncorrelated with anything else that might be happening. The trouble is that now you're running an A/B/... test on only 10% of your traffic, which means that it will converge 10x slower than if you were running it on 100% of your traffic.
However, it seems to me that the extra 90% of data that I've proposed ignoring isn't that useful, because it's only coming from one slice at a time. What you really want is to get more data from the slices you know least about. I suspect there are reinforcement learning algorithms that take into account not just the reward rate for each slice, but the current level of certainty with which the algorithm knows the reward rate, so it can collect more data about the slices it knows the least about, and stop collecting data about the slices for which it already has a fairly accurate reward estimate. The question is, are there such algorithms that can also handle non-stationary reward distributions? And how much tuning and tweaking do they require?
1. Real world performance varies over time...
I don't really understand this complaint. With A/B testing, you have to collect enough data before making a decision that the conversion rate fluctuations average out. If you're worried about monthly fluctuations (which is reasonable), and you're doing A/B testing, you need to collect data for a few months. This means months where you aren't optimising. With a bandit algorithm conversion rate fluctuations will cause changes in the bandit behaviour, but they will average out over time just like A/B testing. We can prove this, so I don't know what justification there is for the statement that it is "a big issue for this approach". Indeed I think this points out a weakness of A/B approaches. Firstly, it's arguable that tracking fluctuations in conversion rate is a good thing. They may be temporary, but why shouldn't the algorithm react to temporary changes? Secondly, unlike A/B testing you're not spending months collecting data -- you're actually optimising while you collect data. Finally, what site can afford to spend months not changing, while data is collected?
2. (This is really a special case of #1 - but a very, very important special case...
I see two issues here: 1) the fact the traffic has changed and 2) how long does it take the system to recognise this change?
Addressing 1: Getting technical for a moment, A/B tests assume the distribution generating the data is stationary. This situation violates this assumption, so one shouldn't be using A/B testing here. I'd like to know what is meant by a properly constructed A/B test in this case. Does it mean throwing out all the data before the significant improvement? If you say that a non-stationary distribution is fine, then you can develop a bandit algorithms for non-stationary situations (e.g. http://arxiv.org/abs/0805.3415). Myna handles non-stationary situations, so long as they change quite slowly.
Point 2 is really about convergence rates. There is some work here (http://imagine.enpc.fr/~audibert/Mes%20articles/ALT11.pdf) and the upshot is that convergence rate can be a problem in theory though you can design for it. Notably we haven't observed this issue in practice.
3. Code over time gets messy...
This is a non-issue in my experience. At some point you decide that an experiment is no longer worthwhile. Either you're redesigning things so it doesn't make sense anymore, or the experiment has been running long enough that you are confident of no more changes. When you hit this point you remove the code. It can be a year or a hour after starting the experiment. Either way, you know with a bandit algorithm you will have optimised what traffic you did receive.
4. Businesses are complex, and often have multiple measures they would like to balance...
The glib answer to this is to set your reward criteria correctly, but more below.
5. Many tests perform differently on existing users and new users...
Both of these are really about performing extra analysis on the data. Myna doesn't help here, and we could make it better in this regard by, for example, exporting data. Some of this extra analysis, such as cohort analysis, we can and will automate in the future. There will always be analyses that we can't or don't perform, so there will always be jobs for people who perform these analyses :-) On the other hand, bandit algorithms are used enough in practice (e.g. Google uses them for AdWords) that their utility has been validated. It's a trade-off between time and return. We want to automate the common cases so the analysts can concentrate on the interesting parts.
Now to reply point by point.
With the assumption that I provided above, A/B testing merely needs to collect enough data to detect real differences in user behavior. After a few days you may not know what the average long-term rate is, but you have strong evidence that one is better.
I have absolutely no idea where you got the impression that A/B tests typically run for months, or that you can't touch anything else while it is running. How long they need to run varies wildly (depends on traffic volume, conversion rates, etc), but for many organizations a week is a fairly long test.
Contrary to your point, A/B tests DO NOT assume that the distribution is stationary. They make the much weaker assumption that if a relative difference is found, that difference will continue to be true later. (If this assumption is only true 95% of the time, it is still useful...)
When you're running multiple tests the time-dependence of performance is often extreme. For instance while running test 1, you introduce test 2, it wins, and you adopt it. Now the time period you're running test 1 has 2 conversion bumps of around 5% in the middle. This is not a problem for properly constructed A/B tests. But it is a serious violation of, Myna handles non-stationary situations, so long as they change quite slowly.
Now what is a properly constructed A/B test? A properly constructed A/B test is one where membership in any particular test version has only random correlations with any other factor that could materially affect performance. Such factors include time, and other running tests. If each user is randomly placed in a test version upon first encountering them, independently of anything else that happens, then you have a well-constructed A/B test. Your versions will be statistically apples/apples even though early and late users are apples/oranges.
The degree to which this will be seen to be an issue depends on personal taste, and how much you're the person who has to dive through templates with a maze of if conditions. If you talk to management, it is almost never a (visible) condition. Programmers on the ground often disagree.
Getting the business to have the necessary discussions in an abstract volume is a non-starter in my experience. When you have a concrete example, decisions are easier to make.
Also (as happened in the test that I saw last week) often the discrepancy is a sign to another problem. A better reward criteria would have resulted in making a good pick, but looking at the discrepancy found an issue that wouldn't have been found otherwise, and we'll hopefully wind up with an even better option that we would not have found correctly.
My point is that the extra analyses are useful. However this is really a secondary or tertiary point since most companies doing A/B testing are not going this extra mile.
As for AdWords, Google has sufficient volume for a traditional A/B test to converge in minutes. They have so much volume that continuous adaptation can just happen for them. Most businesses are not in such a fortunate place.
could this be solved with an exponential decay?
aka, saying that a click from 1 month ago is less valuable than someone clicking today.
By tweaking the decay you could change how quickly the algorithm will sway when the conversion rate changes.
EDIT: I just saw that rauljara suggested this below: https://news.ycombinator.com/item?id=4040230
1. Every kind of lead gen that I have been involved with and thought to measure has large periodic fluctuations in user behavior. Measure it, people behave differently on Friday night and Monday morning.
2. If you're regularly running multiple tests at once, this should be a potential issue fairly frequently.
3. If you really fire and forget, then crud will accumulate. To get rid of that you have to do the same kind of manual evaluation that was supposed to be the downside of A/B testing.
4. Most people do not track multiple metrics on every A/B test. If so, you'll never see how it matters. I make that a standard practice, and regularly see it. (Most recently, last week. I am not at liberty to discuss details.)
5. I first noticed this with email tests. When you change the subject line, you give an artificial boost to existing users who are curious what this new email is. New users do not see the subject line as a change. This boost can easily last long enough for an A/B test to reach significance. I've seen enough bad changes look good because of this effect that I routinely look at cohort analysis.
That said, the people there are very smart and are doing something good. But I would be very cautious about time-dependent automatic optimization on a website that is undergoing rapid improvement at the same time.
Sarcasm aside, I've also experienced all of these issues with real world testing and would be interested in hearing your argument as to why you think this is not the case.
Most or all of the points suffer from:
* is that actually true?
* does regular a/b testing not also face that issue?
* was it suggested that you must "set it and forget it"?
* are there no mechanisms for mitigating these issues?
* would using 20% or 30% mitigate the issues?
* are you not allowed to follow the data closely with the bandit approach?
The whole list struck me as a supposed expert in the status quo pooh-poohing an easier approach.
Let's address them one by one.
is that actually true?
In every case, yes.
does regular a/b testing not also face that issue?
For the big ones, regular A/B testing does not face that issue. For the more complicated ones, A/B testing does face that issue and I know how to work around it. With a bandit approach I'm not sure I'd have noticed the issue.
was it suggested that you must "set it and forget it"?
Not "must", but it was highly recommended. See paragraph 4 of http://stevehanov.ca/blog/index.php?id=132 - look for the words in bold.
are there no mechanisms for mitigating these issues?
There are mechanisms for mitigating some of these issues. The blog does not address those. As soon as you go into them, you get more complicated. It stops being the "20 lines that always beats A/B testing" that the blog promised.
I was doing some back of the envelopes on different methods of mitigating these problems. What I found was that in the best case you turn into
would using 20% or 30% mitigate the issues?
That would lessen the issue that I gave, at the cost of permanently worse performance.
The permanent performance bit can benefit from an example. Suppose that there is a real 5% improvement. The blog's suggested approach would permanently assign 5% of traffic to the worse version, for 0.25% less improvement than you found.
Now suppose you tried a dozen things. 1/3 of them were 5% better, 1/3 were 5% worse, and 1/3 did not matter. The 10% bandit approach causes you to lose 0.25% conversion for each test with a difference, for a permanent roughly 2% drop in your conversion rate over actually making your decisions.
(Note, this is not a problem with all bandit strategies. There are known optimal approaches where the total testing penalty decreases over time. If the assumptions of a k-armed bandit hold, the average returns of the epsilon strategy will lose to A/B test then go with the winner, which in turn loses to more sophisticated bandit approaches. The question of interest is whether the assumptions of the bandit strategy really hold.)
Whichever form of testing you use, you're doing better than not testing. Most of the benefit just comes from actually doing testing. But the A/B testing approach here is not better by hundredths of a percent, it is about a permanent 2% margin. That's not insignificant to a business.
If you move from 10% to 20%, that permanent penalty doubles. You're trading off certain types of short-term errors for long-term errors.
(Again, this is just an artifact of the fact that an epsilon strategy is far from an optimal solution to the bandit problem.)
are you not allowed to follow the data closely with the bandit approach?
I am not sure what you mean here.
With all metrics, it's important to understand what's actually going into the measure and where it might get tripped up.
A potential solution might be to add a decay factor, so that the older data carries less weight.
The calculation is straightforward once you let some things be the value of identity:
P1 = P0 + Q
K = P0 / (P0 + R)
x1 = x0 + K * (z - x0)
P1 = (1 - K) * P0
x0, P0 - previous score, previous covariance
Q - Roughly related to the age of the last measurement. Goes up with age.
R - Measurement error. Set it close to 0 if you are sure your measurements are always error-free.
z - the most recent measured value.
Let's say you measure number of clicks per 1000 impressions. Now you can estimate the expectation value (x1) for the next 1000. After the second 1000 re-estimate again.
x1 = x0 + alpha * (z - x0)
If "whatever design was most popular at that time" has a billion trials and 90% successes, and "a new superior design" has 100 trials and 97% successes (97 out of 100), than the new design is favored by the algorithm. No need to "catch up" to the absolute number of successes.
Say you have a sports site and test a new soccer oriented layout vs an old baseball heavy one. In the day, the old baseball version wins easily. When NA goes to sleep it would serve up baseball to the Europeans until it loses, then after several hours soccer is the winner. But then it is too late and the Europeans go to sleep and on and on. This is an odd example and assumes equal balance, and the site should really be localized, but you get the point.
You can now evaluate the results conditioned on each group (american / non-american).
Epsilon greedy does well on k-armed bandit problems, but in most applications you likely can do significantly better by customizing the strategy to individual users. That's a contextual bandit and there are simple strategies that to pretty well here too. For instance:
I have 2 ads in an Ad Group, and the wording between the two is different by a single word. One ad had over double the clickthru rate than the other one, just because of that single word difference.
I noticed that Google was serving the two ads about 50% of the time, and was going to shut off the one ad that had the lower CTR, but then I let it go, and the next day, I saw that the more successful ad had almost all the views, and the less successful one was barely being served.
It's almost certainly impracticable, but fun to think about.
But truly someone has to design the learning system, at some level you will always have design, if it is the design of how the design is to adapt to it's environment.
Here's a simple thought experiment to show that this will not 'beat A/B testing every time.' Imagine you have two designs, one has a 100% conversion rate, one has a 0% conversion rate. Simple A/B testing will allow you to pick the the winning example. Whereas this solution is still picking the 0% design 10% of the time.
For some other implementations check out the following links:
For Dynamic Resampling:
For Optimal Termination Time:
You seem to be saying "I'll AB test it just for a little, then weed out the 0% one. but in the case of this new algorithm, I'll let it run for a long time." That's not exactly fair. Not to mention, both algorithms would allow you to clearly see the 0% option sucks.
That's actually very useful for me though. Especially if a site has a lot of tests, or I'm running tests for a multitude of clients. It means I have to babysit the tests less frequently.
At some point you still step in and decide based on the data, especially if you detect degenerate cases, and move on to the next experiment.
Sure the OP advocates "set and forget", but it seems simply silly to do this indefinitely. Obviously, after a while, you're going to want to check in on the results and prune out low performers.
Is that not the entire point of all these different testing methodologies to begin with?
Also, if you were to follow the same "set and forget" strategy with A/B, you'd have a 50% chance of showing the 0% CR design, where-as it's only 10% with this method.
Seems like it wins out in both cases to me.
Even your thought experiment is not conclusive. First, it will take some time to be convinced which variation is better. The lost revenue during the 50/50 experimentation phase may outweigh the long-run benefit if your discount rate is high. This can happen in markets where customer acquisition is slow and expensive (i.e. a niche targeted via AdWords). Second, except in the unrealistic 0% vs 100% case, you could get stuck on wrong choice forever if you pass a statistical significance test by chance. Epsilon-greedy will correct itself over time and is asymptotically guaranteed to make the correct choice most of the time. A/B testing only has this asymptotic guarantee if you spend asymptotically long experimenting, which defeats the whole exercise.
For example, record the conversions from the exploration path vs each individual exploitation attempt. Then, adjust epsilon accordingly, so that it is within a certain percentage of one of your best choices.
See, we've spent the last year doing an Omniture implementation, using T&T, and frankly, its been a waste of money and time.
This isn't a criticism of the T&T tools; its a criticism of our own internal analytics handling and analysis team, and possibly the implementation guidance we received from a few places.
You can guess at generalizations from the A/B/N changes you've made and try them again, but practically speaking? Meh. It seems like the learnings from one page are very hard to transfer to another page.
That's why you don't see posts like "Top ten generalizations about how to make your website better!" instead you pay "SEO Expects" and "Analytics Ninjas" as consultants and they tell you things like "pages with images convert more, generally, but that may not be the case for your specific website because of blah". Handy. Do you have any more generic and obvious advice that doesn't tangibly translate into practical changes to my website?
Here's the thing: Running an A/B is easy, but analyzing the results is really hard. Generating some kind of general analytics rules is very hard.
The reason I like this approach is simple: it's easy. It's easy to implement. It's easy to explain. It's easy to convey to the designers that you're doing a 'throw at the wall and see what sticks' approach to pages. It's easy to explain to managers how why the best page has been picked. You don't need a self important analytics ninja who will go on endlessly about user segments and how if you segment the data differently you can see kittens in the page view statistics.
Just make lots of pages, and see what happens.
It's not perfect, but it's enough to get started~ (...and honestly, that's the most important thing when you're working with analytics; it beats the hell out of a 3 month implementation cycle that ends up with... page views <-- to be fair, personal opinion about uselessness of implementation not presently shared by marketing department, who likes the pretty graphs. Whatever they mean.)
Here's a C++ implementation: https://github.com/joelthelion/uct
Makes me wonder if the 10% number couldn't be changed to something that's a function of the number of rewards total; the longer it runs, the less variation there is and the more confident you are in the choice made.
I noted elsewhere that A/B can be though of as an epsilon-first learning approach, Play random 100% till P-value<alpha, then play greedy(play the 'winner'). As an aside, it is unclear to me how using p-values is a clearer, easier, or more efficient, decision rule for these types of problems. It is almost always misinterpreted as the Prob(B>A|Data), choice of alpha determines threshold but is arbitrary, and often a straw-man default - implicitly biasing your confusion matrix. Not saying that you won't get good results, just that it is not clear that is a dominate approach.
This simple post I wrote on agents and online learning might be informative http://mgershoff.wordpress.com/2011/10/30/intelligent-agents...
But don't take my word for it (disclaimer: I work for www.conductrics.com, which provides decision optimization as a service) take a look at a great intro text on the topic by Sutton & Barto http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.ht...
There was one major flaw with this strategy though:
Lets say you're testing a landing page and have had 1000 visitors and version A is converting at 40% while version B is converting at 30%. So it looks like so:
Version A - 200 / 500 - 40%
Version B - 150 / 500 - 30%
A new affiliate comes on board and decides to send 200 visitors to your page from some "Buy 200 visitors for $1" domain redirection service. These visitors are such low quality that they will never ever buy anything and will probably just close the window immediately (or are bots). Now your results look something like this:
Version A - 200 / 680 - 29.4%
Version B - 150 / 520 - 28.8%
And with just 200 visitors some random affiliate has killed all your results. Now you could add filtering and reports based on the affiliate or traffic source but this is more code and more attention you have to pay to the test.
If you were running a traditional A/B test your results would look like this:
Version A - 200 / 600 - 33%
Version B - 150 / 600 - 25%
And even though the overall conversion rate is lower you can still see version A is better than B.
The idea is good and I love the idea of auto optimization, but it does have it's flaws which require more than 20 lines of code to overcome.
A problem with this (well one of them) is that it could home in on the worst of spammy techniques, like masquerading as a legitimate message, shaking animation, illegal claims and "simple" lies/promises. A solution is to filter these out - one could hand-code specific cases, but a general solution seems as difficult as AI in the first place, and requires external knowledge (such as of infrastructure messages, human visual physiology, laws and their interpretation, the concept of lying). It's hard to gather data (e.g. each lawsuit); but maybe something along the lines of a simple "complaint" button would do (use statistics to discount accidental/abusive clicks).
I am a big fan of two techniques that I feel would enhance an approach like the one suggested: FIR filters and Decay.
Simply put: I believe it is important to have a mechanism through which decisions are not made too quickly. A finite impulse response filter would take care of this very well.
In addition to that, older measurements should not carry the same importance as nice-fresh data. Who cares what people thought about the buttons (or whatever) a month or two ago? The crowd visiting the site this week could have been affected by solar flares. Maybe they prefer something else.
Obviously you need enough traffic to be able to use such techniques.
Not sure it's worth the complexity in all cases.
* the user clicked the button
* the user signed up
* the user put something in their cart
* the user paid X
However, in the real world, things are not always that simple. What if you want to optimise:
* the percentage of users that returns to the site
* the time that users spend on your site
* the number of pages that they view in a session
Option 1: Let the next set of values (colors of the button in the example) themselves be generated after n runs.
Option 2: let users modify the css as part of a "customization " feature. I remember they did this before the new BBC site was officially launched a few years ago.
Just change the epsilon = 0.1 (10%) higher or lower depending on your initial (personal) confidence, and if your guess was right, and your epsilon low, then the overall impact to 'optimal' solution is negligible, but you have built in a fail safe in case you were human after all.
How is this different from Genetic Algorithms?
For instance, if you're testing A, B, and C; you can start off with success/total values of 1/1 for each or 100/100 (for extreme values). If you start off with 1/1, a single hit or a single miss will swing the algorithm quickly and heavily in that particular direction; e.g. 1 miss for C results in 1/2, and brings its success rate down from 100% to 50% immediately, giving instant precedence to A and B. Whereas if you used 100/100 to start, a single miss for C would only bring it down to 100/101, letting the algorithm take much longer to "settle," but with far more confidence.
The trick is in picking a number that suites your needs, e.g. for expensive traffic sources (AdWords) pick smaller numbers to minimize the cost of the experiment and for cheaper, more often sources use larger numbers because you can afford the extra time to be sure.
I'm really looking forward to hearing comments from people With Actual Maths!
It seems the article mentions something quite important in their algorithm: mostly do what has the most expected value. The people who are great just have a much better way to judge that. They can make a poster with 20 design choices (or 50 or 100) and make an estimate of what would probably work and what probably wouldn't on each one, from the size of poster, to the font sizes and types, whitespace, where to place different elements, graphics choices, etc etc etc. They certainly include a random aspect, but this is the exception rather than being the norm. Mostly what dictates choices is your expected returns on them, and the random aspects are compared with these.
They do pay exquisite attention to their random choices: "This week I decided to see what would happen if I mixed up the day, date, location, and description rather than have it be in logical order, to see if this engaged people any more" (or: to change any other choice randomly). But it is the exception rather than the norm, and done rarely rather than often. They still pay attention to the results, which helps inform their "expected value" function.
(I didn't spend much time on the article or the linked papers, so please feel free to correct me if I'm misinterpreting. A rigorous algorithm doesn't have much to do with real-world choices, and we are simply nowhere near having an automated web service to write your copy, regardless of how many users get to give feedback on it. So the whole thing isn't very interesting to me, and the above is just my impression of 'why'.)
1) this approach is better than A/B testing.
2) it doesn't suffer from the the Heisenberg principle
Also, the call out box at the top is obviously an ad but it's not marked "Sponsored link", which is really scammy.
That's because it's a link to a webapp the author built.