This is an interesting technique, but it too has flaws. If there is a period of buzz and excitement surrounding your app, whatever design was most popular at that time will be rewarded accordingly, and accrue a high click through rate with tens of thousands of case. If you introduce a new superior design after the period of buzz has gone away, the new design may take a very long time to catch up. Even though it is currently outperforming the old, the few hundred cases that are being added won't be enough to offset the tens of thousands that came before.With all metrics, it's important to understand what's actually going into the measure and where it might get tripped up.A potential solution might be to add a decay factor, so that the older data carries less weight.

 Better than a forgetting factor, add a Kalman filter (http://en.wikipedia.org/wiki/Kalman_filter). This way you can trust your "new" data more than really "old" data, etc. The beauty of it is that it only adds three attributes to each data sample.
 Could you expound on this a bit? What attributes would you have to add? How would you calculate scores?
 You would add a variance (P), estimate of the value and the timestamp of the last measurement. Using the last timestamp you can calculate Q. Generally, the older the last measurement, the higher Q.The calculation is straightforward once you let some things be the value of identity:`````` P1 = P0 + Q K = P0 / (P0 + R) x1 = x0 + K * (z - x0) P1 = (1 - K) * P0 `````` Now you have the new score for your data (x1) and a new variance to store (P1). Other values are:x0, P0 - previous score, previous covariance Q - Roughly related to the age of the last measurement. Goes up with age. R - Measurement error. Set it close to 0 if you are sure your measurements are always error-free. z - the most recent measured value.Let's say you measure number of clicks per 1000 impressions. Now you can estimate the expectation value (x1) for the next 1000. After the second 1000 re-estimate again.
 Thanks for explaining that!
 How does a decay factor not "trust your 'new' data more than really 'old' data"?
 The Kalman filter is much more sophisticated. Typical re-estimation will be:`````` x1 = x0 + alpha * (z - x0) `````` where alpha is static. The Kalman filter will make it dynamic, taking into account how you obtained the measurements, how old the last re-estimation was, how noisy the process is, etc. Want to do multi-variate analysis? Make alpha a matrix transform.
 I don't think this flaw is real.If "whatever design was most popular at that time" has a billion trials and 90% successes, and "a new superior design" has 100 trials and 97% successes (97 out of 100), than the new design is favored by the algorithm. No need to "catch up" to the absolute number of successes.
 Exactly what I was thinking. What are we missing?
 What was meant was: If there is a period of time when you get a lot of visits, lots of clicks, and abnormally high CTR. This could happen due to external factors, for example if make the front page of HN. Over time, this effect will vanish, but you will be stuck with high CTR estimates for the design that was in place when this happened for a long time.
 If anyone wants to search the literature, the terminology for this is the non-stationary bandit problem. The classical bandit problem is stationary but there has been plenty of research done into non-stationary variants as well.
 Nice post. One hypothetical case it could end up serving the worse design more often is if you had times of the day where your users behave very differently due to time zones, i.e. Europe vs North America.Say you have a sports site and test a new soccer oriented layout vs an old baseball heavy one. In the day, the old baseball version wins easily. When NA goes to sleep it would serve up baseball to the Europeans until it loses, then after several hours soccer is the winner. But then it is too late and the Europeans go to sleep and on and on. This is an odd example and assumes equal balance, and the site should really be localized, but you get the point.
 This is actually pretty straightforward to overcome (and one of the real strengths of the bayesian approach). Rather than using the direct counts of success for each group, add a prior belief that americans will favor baseball and non-americans will favor soccer (you can experiment to determine this number).You can now evaluate the results conditioned on each group (american / non-american).
 Or if you don't want to add in a prior specific belief about americans vs. non-americans, just keep one counter per option per continent-of-source-IP (or whatever) and the learning algorithm should work it out on its own. Of course, if you use too many bins then learning is going to take far too long.
 At this point you could flush out the inferior options and start anew with just the winning design and the new one.
 You can instead of selecting according to the largest expectation select according to the largest value you achieve with a certain variance deviation from the calculated expectation. Law of large numbers will apply and you will have tighter and tighter bounds around the expectations. That is a superior and more efficient method used in monte-carlo simulations, e.g. for AI playing the game Go.
 He mentions this flaw and proposes the same solution.

Search: