1) ideally you would be able to measure change in every metric, not just ones you whitelist for a specific experiment. What if adding one feature changes how people interact with a completely different feature? You would want to know about this.
2) just showing change without any sort of hypothesis testing is just begging for people to draw unfounded conclusions from the results. Instead of a vague note that more than 100 sessions is necessary to get significance, you need to have real confidence intervals at the very least.
The author could have implemented a simple Chi-square test and gotten CIs. The problem is that conversion rates are usually < 6% and that means you'd have to have a MASSIVE sample size to detect a difference.
Our basically Type II error is much more important than typical statistical applications. Our statistical power is super important.
The author could implement Bayesian statistics with a Beta distribution prior initialized with alpha = 3, beta = 100 (mimicking a 3% conversion rate). The results would be robust to this prior information. The problem is that there is no closed-forum likelihood solution. This means you need to use Markov Chain Monte Carlo simulation. Web servers don't like that.
In my experience, if you see a nice 10% boost in conversion rate (conv. b / conv. a) after some representative period of time like a few days, you should just go with that result.
In that way, you don't ignore what's smacking you in the head. "The implementation had a higher conv. rate or not over a few days." Detecting small differences really well with stats is fairly pointless in this space.
1) There has been a lot of thought that's gone into how we should solve this problem. We've played with several different concepts but have ultimately settled on one; starting this weekend, we're adding standard analytics "events" to Bestly where you can send any arbitrary event that you want to track. This will then allow you to visualize how "x experiment, variation A" affects some arbitrary event elsewhere.
2) Yes! We agree. We should be displaying both our confidence level as well as statistical significance rather than the number 100.
Also, I love feedback, so feel free to email me anytime. james [at] best.ly