

MailGun Outage Postmortem - bceagle
http://blog.mailgun.com/post/what-happened-yesterday-and-what-we-are-doing-about-it/

======
tedunangst
hmmm, reminds me of [http://www.twilio.com/blog/2013/07/billing-incident-post-
mor...](http://www.twilio.com/blog/2013/07/billing-incident-post-mortem-
breakdown-analysis-and-root-cause.html)

If you have a software stack and it is going to do something that is not
idempotent (like billing customers or sending emails), you need a state
machine more complex than "not done" and "done". You need a "doing" state.
After service is restored and everything is running smoothly, you go through
all the tasks stuck in "doing" and decide whether to retry or abort, based on
other logs or an evaluation of the consequences of not acting vs double
acting. What you do not do is have your software just keep hammering away
until everything magically turns "done".

------
zedpm
I just got an email from them with a link to this; what's interesting is the
garbled name in the To: field of the email. The mail was sent to ">,
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"MyFirst
MyLast\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"
<" <myaddress@example.com>

That's not the sort of thing I'd expect from a company whose entire business
is email...

~~~
alexk
Yeah, that was my error actually, our MIME parser worked just fine and escaped
the stuff I put into TO field. Looks like I need some sleep now, not gonna
write the postmortem about this one though :-)

------
dreeves
Beeminder got bitten pretty bad by this yesterday, as one of the "small number
of customers" affected by the duplicates. Several users let us know they were
seeing quadruplicates. Oy.

We're still huge Mailgun fans though, and have been since they were just
starting out. We've certainly had worse crashes of ineptitude than this
ourselves.

Kudos for the thorough post-mortem!

~~~
alexk
Daniel,

Again, we are sorry this happened. Per our analysis duplicates occurred
because monits were killing delivery processes in emergency, so only small
part of the overall traffic was affected.

That does not make it any better for customers that were receiving
duplicates/quadruplicates though.

------
eksith
Slightly off topic, but posts like this show, in stark detail, why it's a good
idea to turn off comments. Good on you for being straightforward though.

~~~
crystaln
Unless they deleted negative comments, the ones there seem largely
complimentary.

------
rglover
Shit happens. Still an awesome/easy to use service :)

