First of all, great work. It looks like you boosted your conversion rate from 0.19% to 0.43%. Which is a 125% improvement, or with confidence intervals, 55% - 179% improvement.
However, before everybody goes out and puts puppies on their homepages, they need to realize that there are a bunch of things being tested.
Image vs. no image: Is it possible that having any image at all improves the conversion. You should test with other pictures: perhaps some animals, people, nature, and see if the puppy is what makes it work.
Call to action: The 'puppy' version also features a more succinct call to action in "Sign up now" rather than "Start your 30 days free trial." Perhaps this also contributes some of the difference.
Button size: The button size in the 'puppy' version is smaller. Perhaps this has some effect as well.
Length of text: The 'puppy' version has more description of what is involved in the free trial. It says "Pick a plan & sign up in 60 seconds. Upgrade, downgrade cancel at any time." vs. the no puppy version that says "Start you 30 days free trial."
Vertical vs. Horizontal Layout: The 'puppy' version has a vertical layout of the text and button, where they are stacked on top of each other rather than left or right.
So there are at least five different changes made between these two designs. Clearly the second design wins on conversions, but it's not entirely clear to me why it wins.
If nothing else, they have a hypothesis to test in the next experiment.
Even when possible to isolate and remove ancillary changes to improve split test purity, it's often not beneficial. If there's a significant number of changes, achieving statistical significance across the full matrix probably isn't even possible.
But that's ok, because limiting changes to a single test queue restricts your ability to move fast and try lots of stuff, which is beneficial. So when you test, try cheap multivariate methods (there's a bunch!) to quickly understand how interactions between multiple changes affect results.
You can iterate on the other tests over time. Many A/B tests start with a larger change that may include multiple variables but with that baseline increase, now they can go ahead and test Dog vs Cat vs Human as the image. Or can test a variety of different text sizes and lengths. This seems like a fantastic start, with plenty of room for further iteration and improvement.
That approach was discussed by Anscombe, and I wrote up a summary in the Custora Blog. However just because an approach is frequentist or 'ad-hoc' does not necessarily mean that there is anything wrong with it. The bayesian approach requires making assumptions about the number of visitors to your site after you stop the test, which isn't really any less adhoc than picking an error cut off.
I like that article, but have one major qualm about it. Everything that you do in a Bayesian model depends on the prior. Yet you often see - as there - someone tell you, "Here is the rule to use" but without telling you the prior.
However the prior actually matters. For instance when you look at what Nate Silver did, most of the mathematical horsepower went to determining a really good prior to use based on historical data. And armed with that he both can and does make inferences. (Which he's willing to publish.)
That said, the Bayesian approach is conceptually so much better that Bayesian with a questionable prior can be better than a frequentist approach.
Finally the fact that a Bayesian approach needs a somewhat arbitrary planning horizon does not particularly bother me. Financial theory tells us that businesses really should apply a discounting factor to future projected income, and when you apply an exponentially decaying discounting factor, the weighted number of future visitors generally comes out to a finite number. And yes, there are a lot of arbitrary factors in how you get to that number. But you can generally do it in a reasonable enough way to be way less sloppy in your A/B test than every other part of the business is. Heck - you can just say that your planning horizon is 1 year, and use the expected number of visitors in that time as a cutoff.
Anyways I'd like to eventually get into this kind of issue with this series. But whether I can, I don't know. It certainly will be hard if I keep on trying to pitch it to the level of mathematical background that I've been aiming for so far.
I have a problem when companies start claiming personalization to this extreme level. How can they claim that they know that an individual "Gets bored and checks email at 4pm."
They are able to look at their customers and see when they open emails, even report on the average time, but people are so much more noisy than they make it seem.
It's interesting that this attitude of pinning customers to a specific thing is so ingrained in their mentality that they bucket their customers: Johnson only ever drinks water, Aubrey rides his bike every day, rain or shine.
In reality people are complex and multifaceted, and it is important to acknowledge this when marketing to them.
That seems to be a pretty sound approach, compared to some of the stuff about multiarmed bandits that shows up here some times. And I certainly expect Noel Welsh to chime in as well.
There are two schools of thought about the approaches to sequential testing, the Bayesian approach lead by Anscombe, and the frequentist by Armitage. I talked a bit about this and outlined Anscombe's approach here . And it is great to see such a nice write up of the frequentist approach and the tables of the stopping criteria
If all goes perfectly, I will discuss more ways to think about the problem than just those two, and try to show some connections that may surprise people. But, judging by your nice article, I doubt that I'll prove to have anything to say that you don't already know. :-)
I am mostly critical of claims like '20 lines of code that will beat A/B Testing Every Time.' Multi armed bandits are also not as useful for inference as the frequentist methods that Ben presents in his posts.
I've spent a lot of time working with pipelining software, first for my last job doing bioinformatics research, and now for handling analytics workflows at Custora. We ultimately decided to write our own (which we are considering open sourcing, email me if you are interested in learning more).
The initial system that I used was pretty similar to Paul Butler's technique, with a whole bunch of hacks to inform Make as to the status of various MySQL tables, and to allow jobs to be parallelized across the cluster.
At Custora, we needed a system specifically designed for running our various machine learning algorithms. We are always making improvements to our models, and we need to be able to do versioning to see how the improvements change our final predictions about customer behavior, and how these stack up to reality. So in addition to versioning code, and rerunning analysis when the code is out of date we also need to keep track of different major versions of the code, and figure out exactly what needs to be recomputed.
We did a survey of a number of different workflow management systems such as JUG, Taverna, and Kepler. We ended up finding a reasonable model in an old configuration management program called VESTA. We took the concepts from VESTA and wrote a system in Ruby and R to handle all of our workflow needs. The general concepts are pretty similar to to Drake, but it is specialized for our ruby and R modeling.
This is an interesting example of why randomization in experiments is important. If you allow users to self select into the experiment and control group, and then naively look at the results, the results might come up opposite from what is expected. This is known as Simpson's Paradox. In this case, it was only the users for whom page load was already the slowest that picked the faster version of the page. So naively looking at page load times made the pages look like they loaded slower.
However once Chris controlled for geography, he was able to find that there was a significant improvement.
Moral of the story: run randomized A/B tests, or be very careful when you are analyzing the results.
You have to be careful of how you sample for the A/B testing. Even if they properly chose a set of users to get Feather, and a set of users to stay at baseline, their results would STILL get skewed, since the remote users who now got Feather would view disproportianately more videos than the baseline, pushing up average load time in the experimental group anyway.
The question of Best Buy settling has come up. Most settlements involve a clause that prohibits either party from disclosing the terms of the settlement. Since the primary objective of First Round taking on the law suit was to 'teach big businesses a lesson.' A settlement would have been counterproductive for them, even if it would have resulted in a substantially higher payout or significantly decreased legal costs. First Round wanted blood to be shed publicly.