
When Randomization Is Not Enough: Improving Sample Balance in Online A/B Tests - kilimchoi
https://www.thumbtack.com/engineering/repeated-rerandomization/
======
bradleybuda
I am not a statistician, but this fails two sniff tests for me. Can someone
explain where my intuition is wrong?

1) Yes, you might find that running an A/A test on your data shows a "result"
for one of your groups, even though there obviously is none; this (as I see
it) is just due to insufficient data. Isn't running an A/A test isomorphic (in
a strong sense) to correctly checking the statistical significance of your A/B
test, and only believing your A/B test if it does in fact reach a high
significance threshold?

2) Re-randomizing your data the way they describe will inherently increase the
"orderedness" (decrease the entropy) of you data set, which risks biasing the
A and B groups in some what you don't fully understand. It seems like the re-
randomization procedure would at best leave the statistical properties of your
A and B groups unaltered, and at worst introduce a new bias that makes your
result fishy.

I'm sure there's something deeper going on here, but the post doesn't explain
it well.

~~~
jib
They are saying that the variation in tests can be lowered by including
covariant variables in selection criteria. Let's say one product of many I
sell a "drivers license test kit" and somehow know the age of my site
visitors. I have an idea of how to change how I advertise the kit on my
website. Most of the people who would buy the kit are in a very specific age
bracket (16-20 year olds depending on legal age in your country). My site has
an even spread of visitors ages. if I random split my test pool and it ends up
with an imbalance in ages my results will be skewed so I can lower amounts of
tests needed if I don't form test groups OR control groups where all 16-20
year olds are on one side. I "know" those tests would be irrelevant so I just
don't run them. If they would have yielded relevant results I'm screwed but if
not, I saved time.

For your 1 and 2, yes and yes. They measure different things. But I can see
that "a test population of site visitors of an even age distribution" would
yield answers faster than a random distro. You are asserting a belief rather
than testing for it, adding a given, so reducing problem space, so trading
lower variance in results for higher uncertainty of the significance (because
maybe your assertion that age drives interest is wrong, and then all tests are
wrong). P (this test is totally meaningless) goes up, but P (if this test is
meaningful the results are likely close to true) also goes up? That's how I
read it at least.

