1. Every kind of lead gen that I have been involved with and thought to measure has large periodic fluctuations in user behavior. Measure it, people behave differently on Friday night and Monday morning.
2. If you're regularly running multiple tests at once, this should be a potential issue fairly frequently.
3. If you really fire and forget, then crud will accumulate. To get rid of that you have to do the same kind of manual evaluation that was supposed to be the downside of A/B testing.
4. Most people do not track multiple metrics on every A/B test. If so, you'll never see how it matters. I make that a standard practice, and regularly see it. (Most recently, last week. I am not at liberty to discuss details.)
5. I first noticed this with email tests. When you change the subject line, you give an artificial boost to existing users who are curious what this new email is. New users do not see the subject line as a change. This boost can easily last long enough for an A/B test to reach significance. I've seen enough bad changes look good because of this effect that I routinely look at cohort analysis.
That said, the people there are very smart and are doing something good. But I would be very cautious about time-dependent automatic optimization on a website that is undergoing rapid improvement at the same time.