> The turnaround time also imposes a welcome pressure on experimental design. People are more likely to think carefully about how their controls work and how they set up their measurements when there's no promise of immediate feedback.
This seems like a cranky rationalization of the lack of a fairly ordinary system.
Sure, you shouldn't draw conclusions for potentially small effects on < 24 hours of data. But of course if you've done any real world AB testing, much less any statistics training, you should already know that.
What this means is you can't tell whether an experiment launch has gone badly wrong. Small effect size experiments are one thing, but you can surely tell if you've badly broken something in short order.
Contrary to encouraging people to be careful, it can make people risk averse for fear of breaking something. And it slows down the process of running experiments a lot. Every time you want to launch something, you probably have to launch it at a very small % of traffic, then you have to wait a full 24-36 hours to know whether you've broken anything, then increase the experiment size. Versus some semi-realtime system: launch, wait 30 minutes, did we break anything? No? OK, let's crank up the group sizes... Without semi-realtime, you have to basically add two full days times 1 + the probability of doing something wrong and requiring relaunch (compounding of course) to the development time of everything you want to try. Plus, if you have the confidence that you haven't broken anything you can much larger experiment sizes so you get significant results much faster.
If the people running experiments do not know and cannot be told how to do the fundamental thing that they are trying to do, then you have bigger problems.
Which is ultimately what this post points to: the author doesn't trust his team and isn't listening to them and doesn't expect they will listen to him. Regardless of the degree to which the author is correct in his assumptions, the problem is more than just engineering.
You can totally break some product functionality somehow without necessarily triggering a software exception or server crash! You really do need to know the target events per experiment group.
You can break your product without noticeably affecting the things like the http error 500 rate, cpu utilization %, etc. that you would likely see on some ops dashboard.
On our ops dashboard we see stuff like number of ID syncs, number of events processed (by type), etc. - I'd argue that is something is truly "broken" you see it.
If you're using funnel analytics to decide that the product is broken - i'd say you probably do something wrong.
> What this means is you can't tell whether an experiment launch has gone badly wrong.
Personally I prefer automated testing to tell me if a feature has gone badly wrong, not conversion numbers. Then I find out before I launch too.
Or do you mean that the UX is so badly designed the users cannot use your software anymore? In which case, maybe there are bigger problems than real time analytics
This seems like a cranky rationalization of the lack of a fairly ordinary system.
Sure, you shouldn't draw conclusions for potentially small effects on < 24 hours of data. But of course if you've done any real world AB testing, much less any statistics training, you should already know that.
What this means is you can't tell whether an experiment launch has gone badly wrong. Small effect size experiments are one thing, but you can surely tell if you've badly broken something in short order.
Contrary to encouraging people to be careful, it can make people risk averse for fear of breaking something. And it slows down the process of running experiments a lot. Every time you want to launch something, you probably have to launch it at a very small % of traffic, then you have to wait a full 24-36 hours to know whether you've broken anything, then increase the experiment size. Versus some semi-realtime system: launch, wait 30 minutes, did we break anything? No? OK, let's crank up the group sizes... Without semi-realtime, you have to basically add two full days times 1 + the probability of doing something wrong and requiring relaunch (compounding of course) to the development time of everything you want to try. Plus, if you have the confidence that you haven't broken anything you can much larger experiment sizes so you get significant results much faster.