
Power, minimal detectable effect, and bucket size estimation in A/B tests - runesoerensen
https://blog.twitter.com/2016/power-minimal-detectable-effect-and-bucket-size-estimation-in-ab-tests
======
fotp
What is you approach for dealing with non-binary metrics that aren't normal?
I've run into many cases where something like "Average Tweets Liked" tends to
be heavily skewed right (extreme values and a LOT of zeros / ones). The nature
of these distributions throws off the bucket size calculation. Any thoughts?

~~~
cshimmin
They briefly touched on the ability to trigger experiments. You can pretty
easily do zero-suppression by simply looking at users who already regularly
"like" tweets, and more generally by crafting a clever enough trigger you can
move the distribution to be closer to normal. Basically you end up sampling a
more uniform (sub)population that your experiment can at least quantify
results about. If you want to focus on getting people who don't "like" tweets
to start doing so, you just invert the trigger and look at that population
independently. And then you have fun looking at (anti)correlations between the
response of the two populations.

In general it's really hard to work with non-gaussian statistics when no
realistic model for the data is available, particularly when trying to apply
asymptotic techniques like these. In practice it's usually easier to just try
to craft a test statistic that behaves in a more friendly way, and that's
where experimentalists have to be clever :)

