The independent improvement also helps the random bandit choices. The problem is...

The independent improvement also helps the random bandit choices. The problem is that you are comparing A from a largely new population with Bs that are mostly from an old population. It takes a long time to accumulate enough new Bs to resolve this issue.

A forgetting factor will help.

This is a variant of the cohort issue that you're talking about.

The cohort issue that you're talking about raises another interesting problem. If you have a population of active users, and you want to test per user, you often will find that your test population ramps up very quickly until most active users are in, and then slows down. The window where most users are assigned is a period where you have poor data (you have not collected for long, users have not necessarily had time to go to final sale).

It seems to me that if you want to use a bandit method in this scenario, you'd be strongly advised to make your fundamental unit the impression, and not the user. But then you can't hide the fact that the test is going on. Whether or not this is acceptable is a business problem, and the answer is not always going to be yes.