I built an internal a/b testing platform with a team of 3-5 over the years. It needed to handle extreme load (hundreds of millions of participants in some cases). Our team also had a sister team responsible for teaching/educating teams about how to do proper a/b testing -- they also reviewed implementations/results on-demand.
Most of the a/b tests they reviewed (note the survivorship bias here, they were reviewed because they were surprising results) were incorrectly implemented and had to be redone. Most companies I worked at before or since did NOT have a team like this, and blindly trusted the results without hunting for biases, incorrect implementations, bugs, or other issues.
> It needed to handle extreme load (hundreds of millions of participants in some cases).
I can see extreme loads being valuable for an A/B test of a pipeline change or something that needs that load... but for the kinds of A/B testing UX and marketing does, leveraging statistical significance seems to be a smart move. There is a point where a large sample is trivially more accurate than a small sample.
Even if you're testing 1% of 5 million visitors, you still need to handle the load for 5 million visitors. Most of the heavy experiments came from AI-driven assignments (vs. behavioral). In this case the AI would generate very fine-grained buckets and assign users into them as needed.
Do you know if there were common mistakes for the incorrect implementations? Were they simple mistakes or more because someone misunderstood a nuance of stats?
I don't remember much specifics, but IIRC, most of the implementation related ones were due to an anti-pattern from the older a/b testing framework. Basically, the client would try and determine if the user was eligible to be in the A/B test (instead of relying on the framework), then in an API handler, get the user's assignment. This would mean the UI would think the user wasn't in the A/B test at all, while the API would see the user as in the A/B test. In this case, the user would be experiencing the 'control' while the framework thought they were experiencing something else.
That was a big one for awhile, and it would skew results.
Hmmm, another common one was doing geographic experiments when part of the experiment couldn't be geofenced for technological reasons. Or forgetting that a user could leave a geofence and removing access the feature after they'd already been given access to it.
Almost all cases boiled down to showing the user one thing while thinking we were showing them something else.
I wonder if that falls under mistake #4 from the article, or if there's another category of mistake: "Actually test what you think you're testing." Seems simple but with a big project I could see that being the hardest part.
I actually just read it (the best I could, the page is really janky on my device) I didn’t see this mistake on there and it was the most common one we saw by a wide margin in the beginning.
Number 2 (1 in the article) was solved by the platform. We had two activation points for UI experiments. The first was getting the users assignment (which could be cached for offline usage). At that point they became part of the test, but there was a secondary one that happened when the component under test became visible (whether it was a page view or a button). If you turned on this feature for the test, you could analyze it using the first or secondary points.
One issue we saw with that (which is potentially specific to this implementation), was people forgetting to fire the secondary for the control. That was pretty common but you usually figured that out within a few hours when you got an alert that your distribution looked biased (if you specify a 10:20 split, you should get a 10:20 ratio of activity).
Same experience here for the most part. We're working on migrating away from an internal tool which has a lot of problems: flags can change in the middle of user sessions, limited targeting criteria, changes to flags require changes to code, no distinction between feature flags and experiments, experiments often target populations that vary greatly, experiments are "running" for months and in some cases years...
Our approach to fixing these problems starts with having a golden path for running an experiment which essentially fits the OP. It's still going to take some work to educate everyone but the whole "golden path" culture makes it easier.
When we started working on the internal platform, this was exactly the problems we had. When we were finally deleting the old code, we found a couple of experiments that had been running for nearly half a decade.
For giggles, we ran an analysis on those experiments: no difference between a & b.
That's usually the best result you can get, honestly. It means you get to make a decision of whether to go with a or b. You can pick the one you like better.
It’s an experiment, you shouldn’t be “expecting” anything. You hypothesize an effect, but that doesn’t mean it will be there and if you prove it wrong, you continue to iterate.
This is the biggest lie in experimentation. Of course you expect something. Why are you running this test over all other tests?
What I'm challenging is that if a team has spent three months building a feature, you a/b test it and find no effect, that is not a good outcome. Having a tie where you get to choose anything is worse than having a winner that forces your hand. At least you have the option to improve your product.
> What I'm challenging is that if a team has spent three months building a feature, you a/b test it and find no effect, that is not a good outcome.
That's a great outcome. At one company we spent a few months building a feature only for it to fail the test, now that was a bad outcome. The feature's code was so good, we ended up refactoring it to look like the old feature and switching to that. So there was a silver lining, I guess.
The key takeaway was to never a/b test a feature that big again. Instead we would spend a few weeks to build something that didn't need to scale nor feature complete. (IOW, an MVP/POC shitty code).
If it had come out that there was no difference, we would have gone with the new version code because it was so well built -- alternatively, if the code was shit, we probably would have thrown it out. That's why its the best result. You can write shitty POC code and toss it out -- or keep it if you really want.
> Of course you expect something. Why are you running this test over all other tests?
Because it has the best chance to prove/disprove your hypothesis. That's it. Even if it doesn't, all that means is that the metrics you're measuring are not connected to what you're doing. There is more to learn and explore.
So, you can hope that it will prove or disprove your hypothesis, but there is no rational reason to expect it to go either way.
But why this hypothesis? Sometimes people do tests just to learn as much as they can, but 95%+ of the time they’re trying to improve their product.
> there is no rational reason to expect it to go either way.
Flipping a coin has the same probability of heads during a new moon as during a full moon. I’m going to jump ahead and expect that you agree with that statement.
If I phrase that as a hypothesis and do an experiment, suddenly there’s no rational reason to expect it to go either way? Of course there is. The universe didn’t come into being when I started my experiment.
Null hypothesis testing is a mental hack. A very effective one, but a hack. There is no null. Even assuming zero knowledge isn’t the most rational thing. But, the hack is that history has shown that when people try to act like they know nothing, they end up with better results. People are so overconfident that pretending they knew nothing improved things! This doesn’t mean it’s the truth, or even the best option.
I’d suggest reading up on the experimental method. There’s also a really good book: Trustworthy Online Controlled Experiments.
You are trying to apply science to commercial applications. It works, but you cannot twist it to your will or it stops working and serves no purpose other than a voodoo dance.
> Flipping a coin has the same probability of heads during a new moon as during a full moon. I’m going to jump ahead and expect that you agree with that statement.
As absurd as it sounds, it’s a valid experiment and I actually couldn’t guess if the extra light from a full moon would have a measurable affect on a coin flip. Theoretically it would, as light does impart a force… but whether or not we could realistically measure it would be interesting.
Yes, I’m playing devils advocate, but “if the button is blue, more people will convert” is just as absurd a hypothesis, yet it produced results.
Late response: I’ve read that book. I also work as a software engineer on the Experimentation Platform - Analysis team at Netflix. I’m not saying that makes me right, but I think it supports that my opinion isn’t from a lack of exposure.
> You are trying to apply science to commercial applications. It works, but you cannot twist it to your will or it stops working and serves no purpose other than a voodoo dance.
With this paragraph, you’ve actually built most of the bridge between my viewpoint and yours. I think the common scientific method works in software sometimes. When it does, there are simple changes to make it so that it will give better results. But most of the time, people are in the will-twisting voodoo dance.
People also bend their problems so hard to fit science that it’s just shocking to me. In no other context do I experience a rational, analytical adult arguing that they’re unsure if a full moon will measurably affect a count flip. If someone in a crystal shop said such a thing, they’d call it woo.
Most of the a/b tests they reviewed (note the survivorship bias here, they were reviewed because they were surprising results) were incorrectly implemented and had to be redone. Most companies I worked at before or since did NOT have a team like this, and blindly trusted the results without hunting for biases, incorrect implementations, bugs, or other issues.