A/B testing lading pages is easy. But it's unclear how to A/B test, or create a system where a large numbers of new features can be selectively turned on/off, or have different versions for different users.
This gets more complicated if newer version of a feature has different database structure to point that they conflict. i.e. previous version feature used xyz datapoints and wrote to db. Newer version uses a different formula. Sounds like, ok just create a new table. But now you have a method just shows average across all users. Which is now incorrect stats as some users are on v1 which uses formula1 and v2 uses formula2. So there stats show different things, averaging them will show incorrect trends.