

A/B testing in Mailchimp: 7 years of successful experiments - hansy
http://blog.mailchimp.com/ab-testing-in-mailchimp-7-years-of-successful-experiments/

======
BrianEatWorld
My MailChimp account actually got banned due to limitations in their AB
testing framework.

At the time at least, they had no proper way to segment cohorts, meaning that
in order to control for things like length or circumstances of subscription,
we had to do multiple smaller sends to lists that were subsets of our
subscribers and aggregate the testing data afterwards, rather than being able
to do just one send. On top of this creating a lot of extra work setting up
the sends, it skews the statistics on things like SPAM complaints.

What caused the account to be banned was that some of the smaller lists would
occasionally see spikes in SPAM complaints. These spikes were well within
statistical norms. However, because MailChimp will ban an account that breaks
a threshold of complaints for a single list in a single sending, we triggered
the ban, despite the list and send as a whole, being well within their limits.
Had the AB testing framework been built properly, to account for cohorts, we
would have had no issues.

I tried to speak with their CS to explain the issue, but it seems either they
could not understand or weren't interested. All of this, after multiple
communications with their tech team about the limitations of their frameworks
and how they could improve their AB testing.

If you are looking to do anything beyond incredibly basic AB Testing, I would
look elsewhere. We ended up at SendGrid, though my recent issues with their
billing prevent me from recommending them either.

~~~
Silhouette
I don't know whether this is still the case, but last time I looked into
MailChimp they appeared extremely conservative/paranoid regarding spam. There
seemed to be dire warnings all over the place about how anyone doing anything
they didn't like would get banned. I wonder whether they once got stuck on a
lot of blacklists after a bad customer abused their system, maybe some time in
their early days when they were trying to establish credibility, and this
(perhaps understandably) led to a policy of shooting first and not asking a
lot of questions even later.

------
noelwelsh
Hmmm... in general this method of testing (sending to a fraction of the list)
is underpowered and it seems like they also encourage early stopping. It's
standard practice for sure, but also bad practice in terms of getting accurate
results.

Would say more but I don't have time.

~~~
SatvikBeri
Ravi Parikh from Heap goes into detail about why early stopping can screw up
your test results: [http://data.heapanalytics.com/dont-stop-your-ab-tests-
part-w...](http://data.heapanalytics.com/dont-stop-your-ab-tests-part-way-
through/)

~~~
beejiu
Well, testing to a fixed sample to run your test and then sending to the other
80% is not an example of early stopping. If you want to send 8,000 emails and
conduct a test on the first 1,000, then that is not early stopping. It would
be early stopping if you looked at the preliminary results and decided to stop
at 500 instead. Otherwise, the statistical interpretation is sound. There are
statistical methods to deal with early stopping if you don't want to complete
tests on huge samples (e.g. Sequential Probability Ratio Tests), which may be
suitable in some cases.

~~~
noelwelsh
With email you don't know if the user isn't going to open your email, or they
just haven't opened it yet. So there is a survivor analysis aspects to A/B
testing of email -- at any point in time you might get more information from a
fixed sample just by waiting.

This definitely effects the power of tests. However there is an early stopping
aspect in the decision to wait longer to collect more information, or to stop
the test. This is analogous to deciding to collect more samples or stop a
test, which is classic early stopping. In both cases you're making a decision
to stop/continue without considering its impact on your error rate.

On closer reading I note that MailChimp by default avoids this issue, by
making a decision after a fixed period of time. However they aren't
controlling for loss of power and so forth.

\-----

Other points:

\- With a finite population, such as a mailing list, minimising regret is a
better objective than statistical significance. (Want to know more? Sign up at
bandits.mynaweb.com and it will be covered in due course.)

\- There is a multiple testing aspect to what MailChimp does, and I bet they
don't control for false discovery rate.

\- Also some post-hoc reasoning.

\- Statistics is hard, and it's easy to criticise others and hard to do well
oneself. I hope this post will be taken as constructive criticism.

------
stevoski
I was hoping for some conclusions of what made for good emails...but instead
the article just told me that MailChimp offers A/B testing.

------
mfrommil
"We recommend testing only one difference between the A and B groups. When
there are several differences between test groups, it’s difficult to figure
out which change impacted your engagement."

A/B testing may be better to figure out one specific aspect (time to send,
sender name) but wouldn't multivariate testing make sense if you were looking
for the best combination of those variables? i.e. if I wanted to find out that
sending an email at 9pm, from a corporate account, with larger images is the
best combination?

I'm interested to hear others' opinions on the pros/cons.

~~~
ronaldx
I read this as marketing bluff:

Testing one difference guarantees that one side will beat the other (whether
or not there is a true difference).

Multiple testing has a danger of the results appearing inconsistent/random
(again, whether or not there is a true difference).

Encouraging people to test one difference benefits MailChimp, for exactly the
reason that they say - the results will always appear to be relevant.

~~~
gus_massa
That’s why you must do at least a back of the envelope test to see if the
results are significant. If you send two set of n mails, the expected
variations is something like sqrt(n) (I don’t remember now, probably sqrt(n)/4
or sqrt(n)/2, or something like that.) Ask a statistician for the exact
number.

One easy experiment is to do the null A/B test. Send two sets of identical
mails and analyze the difference of the result of the two sets. Usually you
will get more variation that most people naively expect.

------
beejiu
How does a difference between 99.9% and 100% for successful deliveries show as
statistically significant? There's surely a bug there?

~~~
RA_Fisher
As n goes to infinity, all things become significant.

------
ritonlajoie
For some company (a two person cpy), I'm using Mailchimp to send emails to
their 3500 clients. The thing is, I don't really want a mailing list stuff.

Today, what are my options if I just want to send HTML emails through outlook
to my 3500 subscribers, without being flagged as SPAM ?

~~~
jdjb
Mailchimp offers Mandrill, which is a basic SMTP service.

~~~
dbpatterson
Mandrill is transactional, so would be a no go for sending a message to 3500
clients. Mailchimp and mandrill don't overlap.

~~~
ceejayoz
False.

[http://help.mandrill.com/entries/22242948-What-is-
Mandrill-](http://help.mandrill.com/entries/22242948-What-is-Mandrill-)

> Technically, though, you can send any legal, non-spam emails through
> Mandrill.

