This type of error (which can happen very easily on a website going through continuous improvement) can take a very long time to recover from.
A/B tests do not have an issue with this because all versions will have similar mixes of data from before and after the improvement.
Better yet, make x dependent in some way on time since the last change, and relative change in performance of all options from before and after the change.
My problem with the bandit method is that I want to show the same test choice to the same person every time he sees the page so you can hide that there is a test. If I do this with the bandit algo then it warps the results because different cohorts have different weightings of the choices and differing cohorts behave very differently for lots of reasons.
A forgetting factor will help.
This is a variant of the cohort issue that you're talking about.
The cohort issue that you're talking about raises another interesting problem. If you have a population of active users, and you want to test per user, you often will find that your test population ramps up very quickly until most active users are in, and then slows down. The window where most users are assigned is a period where you have poor data (you have not collected for long, users have not necessarily had time to go to final sale).
It seems to me that if you want to use a bandit method in this scenario, you'd be strongly advised to make your fundamental unit the impression, and not the user. But then you can't hide the fact that the test is going on. Whether or not this is acceptable is a business problem, and the answer is not always going to be yes.