With all metrics, it's important to understand what's actually going into the measure and where it might get tripped up.
A potential solution might be to add a decay factor, so that the older data carries less weight.
The calculation is straightforward once you let some things be the value of identity:
P1 = P0 + Q
K = P0 / (P0 + R)
x1 = x0 + K * (z - x0)
P1 = (1 - K) * P0
x0, P0 - previous score, previous covariance
Q - Roughly related to the age of the last measurement. Goes up with age.
R - Measurement error. Set it close to 0 if you are sure your measurements are always error-free.
z - the most recent measured value.
Let's say you measure number of clicks per 1000 impressions. Now you can estimate the expectation value (x1) for the next 1000. After the second 1000 re-estimate again.
x1 = x0 + alpha * (z - x0)
If "whatever design was most popular at that time" has a billion trials and 90% successes, and "a new superior design" has 100 trials and 97% successes (97 out of 100), than the new design is favored by the algorithm. No need to "catch up" to the absolute number of successes.
Say you have a sports site and test a new soccer oriented layout vs an old baseball heavy one. In the day, the old baseball version wins easily. When NA goes to sleep it would serve up baseball to the Europeans until it loses, then after several hours soccer is the winner. But then it is too late and the Europeans go to sleep and on and on. This is an odd example and assumes equal balance, and the site should really be localized, but you get the point.
You can now evaluate the results conditioned on each group (american / non-american).