
Mandrill has been down for over 30 hours with no explanation - slau
https://twitter.com/teotwaki/status/1092624972252618754
======
neya
Unfortunately this is standard MailChimp way of doing things ever since they
screwed over paying customers and merged with Mandrill [1].

They are the opposite of a transparent organization - they are the GoDaddy of
this business. (I was super pissed once they removed their status page..WTF).
We hit these errors a few months ago with our clients and I put up a roadmap
to move all of them off the terrible joke that Mandrill is, in under a month.
None of my clients regretted the move.

I really don't miss anything about them and the market has caught up and
they've been stagnant for way too long without any changes whatsoever and are
still charging a premium for a prototype level quality of service.

If you're still using Mandrill in 2019, move out as soon as you can. I'm
saying this with public interest as I had enough of these jokers.

[1] [http://www.dangrossman.info/2016/02/28/mandrills-
betrayal/](http://www.dangrossman.info/2016/02/28/mandrills-betrayal/)

Edit: I built a Mandrill clone built on SES, Phoenix/Elixir and hosted on
Google AppEngine which is basically Google's managed hosting service. I moved
everything there and it works pretty well, planning to open source it at some
point. The awesome thing about this is I can work on some specific features my
clients want that isn't available on vanilla Mandrill.

I chose this route because I lost a lot of money with Mandrill compensating my
clients for their fault, so I calculated that the money I lost over time would
have been the same amount if I had invested in building a Mandrill clone
myself. I plan to open source it at some point.

~~~
buf
I moved to sendgrid when they announced mandrill's merger and have never been
happier.

~~~
cremp
Except now, sendgrid is merged with twilio; so... history repeating?

~~~
edoceo
No. Twilio has a status page :)

~~~
jjeaff
And twilio seems to be very serious about uptime. Not sure if it translated to
all the facets of their business, but phone services being down will lose you
customers quickly.

------
slau
Disclaimer: it's my tweet that is linked in this story, and I submitted it.
There's no news story as far as I know. Better links are appreciated.

Mandrill stopped delivering email at 04:51 UTC on the 4th of February. Nobody
knows whether the received emails are lost, what the cause is, how soon any
change will occur. It took Mandrill over 9 hours to acknowledge the issue.

They are posting the same message over and over, and have now been silent for
over 9 hours.

Mandrill also recently got rid of their status page. A few months ago, their
API has started returning nginx errors, and their status page looked like a
Christmas tree; every reload would indicate different green/orange statuses.
The certificate their MX servers uses is invalid (wrong domain), which
prevents email delivery from compliant Danish email servers.

~~~
adnans
We're currently using Mandrill to send outbound emails, although we've had
issues with some API calls, all the outbound emails we sent have been
delivered.

Certain things like viewing email content look broken on certain emails which
doesn't hurt the business, yet...

It's true though that this service is nearing death as hasn't had an update to
any of its features since it merge with Mailchimp.

Looking to change service providers soon, as soon as we figure out how to
render emails to html before sending. Mandrill has a `render` endpoint which
makes this easy. None of the others have this yet.

~~~
brobdingnagians
The weirdest UX problem I've found with Mandrill is that you can't use the
browser "CTRL-F" to search an email template unless the text is _currently_
visible. Not sure how they manged to mess up the search functionality for a
text box that is intended to have massive amounts of text in it. It makes it
harder to do a minor spot edit of a template than it should be. Things begin
to make sense now...

~~~
crisscrosscrash
I believe this is a performance optimization to limit what's actually in the
DOM for large documents. The code editor is fairly common and is also used by
Google Tag Manager, but it annoys the heck out of me too.

------
GuyPostington
Got this email just now.

\- - - -

Hello,

We’re contacting you about an ongoing outage with the Mandrill app. This email
provides background on what happened and how users are affected, what we’re
doing to address the issue, and what’s next for our customers.

What happened Mandrill uses a sharded Postgres setup as one of our main
datastores. On Sunday, February 3, at 10:30pm EST, 1 of our 5 physical
Postgres instances saw a significant spike in writes. The spike in writes
triggered a Transaction ID Wraparound issue. When this occurs, database
activity is completely halted. The database sets itself in read-only mode
until offline maintenance (known as vacuuming) can occur.

The database is large—running the vacuum process takes a significant amount of
time and resources, and there’s no clear way to track progress.

Customer impact The impact to users could come in the form of not tracking
opens, clicks, bounces, email sends, inbound email, webhook events, and more.
Right now, it looks like the database outage is affecting up to 20% of our
outbound volume as well as a majority of inbound email and webhooks.

What we’re doing to address this We don’t have an estimated time for when the
vacuum process and cleanup work will be complete. While we have a parallel set
of tasks going to try to get the database back in working order, these efforts
are also slow and difficult with a database of this size. We’re trying
everything we can to finish this process as quickly as possible, but this
could take several days, or longer. We hope to have more information and a
timeline for resolution soon.

In the meantime, it’s possible that you may see errors related to sending and
receiving emails. We’ll continue to update you on our progress by email and
let you know as soon as these issues are fully resolved.

What’s next We apologize for the disruption to your business. Once the outage
is resolved, we plan to offer refunds to all affected users. You don’t need to
take any action at this time—we’ll share details in a follow-up email and will
automatically credit your account.

Again, we’re sorry for the interruption and we hope to have good news to share
soon.

~~~
cle
If you care about scalability and availability simultaneously, I'm not sure in
these modern times why you would use a relational database. When they fail,
they fail catastrophically and are difficult to recover, as this failure event
(and the never-ending stream of failure events posted to HN) demonstrates.

Don't get me wrong--I love relational databases and they are amazing pieces of
technology. But they are incredibly hard to "do right" at scale while
maintaining availability SLAs.

edit:

I would appreciate if downvoters would explain their decision to downvote, so
that if I'm incorrect then I could at least update my beliefs. My position is
based on years of experience watching relational databases maintained by
professional DBAs catastrophically fail in strange ways, and subsequently
taking a long time to recover, causing complete blackouts. And having yet to
see such failures in managed NoSQL DBs like DynamoDB.

~~~
I_have_receipts
[https://aws.amazon.com/message/5467D2/](https://aws.amazon.com/message/5467D2/)

~~~
cle
What is your point? It was a 6 hour brownout, not a 30+ hour blackout. It is
very unlikely that this kind of outage will happen again for DynamoDB. How
likely is someone else going to run into a transaction wrap around again? If
it's such a well-known issue, then presumably it keeps happening to a lot of
people.

------
kawsper
Just remember that you can't trust their interface, even though their Outbound
page says "Delivered" it isn't delivered unless there are SMTP events
attached, if it looks like this, your email is queued, not sent:
[https://cdn.servnice.com/screenie/c1Og3TcrkLFV9Hg.jpg](https://cdn.servnice.com/screenie/c1Og3TcrkLFV9Hg.jpg)

~~~
chrismeller
Wait, what? So delivered doesn’t mean delivered... who thought this was
acceptable UX?

------
etjossem
A few of my teammates at SendGrid have been following the situation, and we
definitely feel for the engineers who are scrambling to fix the problem. It's
never fun to get paged, especially when the trust of your customers on the
line.

Some folks on the email thread were personally involved in handling major
outages in the early days. We've had to learn a lot of hard lessons since
then. Even when everything seems like it's going fine ("wow, we're growing so
fast, good problems!"), scaling issues could be right around the corner.

Anyone with a large enough installation of Postgres could've had the
wraparound issue we're seeing right now. That's why it's important to monitor
for what could go wrong, detect these issues early, and provide customers with
rapid communication so they can plan around it.

Sending our best wishes to the MailChimp engineering teams working on the
problem right now. Good luck, you've got this!

------
a2tech
We moved over to SES a long time ago. Mandrill was basically the cheapest game
in town, but their terrible service coupled with Mailchimp essentially
abandoning the platform after their acquisition made us jump ship. I feel bad
for anyone still using their service.

~~~
throwaway2016a
Depends on your volume. Mandril has a monthly fee and thanks to pay as you go
pricing SES is cheaper up to a decent volume for a lot of startups sending
just transactional email.

------
bvm
This is honestly the most poorly communicated downtime I've ever experienced.
Now, they give a non-update saying they're going to send an email (yes, email)
about it later today:
[https://twitter.com/mandrillapp/status/1092810757929086977](https://twitter.com/mandrillapp/status/1092810757929086977)

------
coleca
Sending #HugOps over to the Mandrill Ops team. As frustrating it is to not
have a service you depend on be available, if you've been around long enough,
you've probably been on the other end of that kind of outage and know it isn't
fun, expected or malicious.

------
ericcholis
Things like this influence stakeholder decisions down the road. I've been
evaluating migrating to Mandrill for some time to give our designer more
control over transactional emails. Now, it's unlikely that I'll do so.

(Worth noting, that I'm a happy long-time Mailchimp customer)

~~~
shedside
We used Mandrill for some time, then moved to SendGrid, and just recently have
moved again to Postmark because we were seeing delivery issues. FWIW, Postmark
has been fantastic.

~~~
brandonmenc
Postmark is the best. Rock solid, and a singular focus on one thing. I've been
using it for years.

------
jdc0589
> MOST outbound emails are sending

oh, so normal operation then

~~~
jdc0589
for context: here's a _partial_ view of our outbound error rate over the past
60 hours (this is maybe ~30% of all total mandrill errors):
[https://puu.sh/CHDxb/729e52451f.png](https://puu.sh/CHDxb/729e52451f.png). We
see around a 2% failure rate in Mandrill requests on a normal day though, its
bad enough we do two immediate retries before allowing our queuing system to
start handling retries with backoff.

------
entity345
Oh boy, that twitter thread is brutal... It looks like Mandrill's staff has
gone MIA.

------
samat
I don't care about Mandrill (who uses it, anyway), but deeply concerned if I
should expect same quality, incident response and attitude to customers from
parent company — Mailchimp.

~~~
gfwhukku
That's what I was thinking. Who hasn't moved to Mailgun already?

~~~
samat
Why not AWS SES?

~~~
frereubu
AWS SES doesn't have things like reporting, bounce handling etc. out of the
box - you need to set up all of that yourself (or at least you did when I last
looked at it a few years ago).

~~~
ethagnawl
This was all still true ~6 months ago. Compared to Mailchimp and friends, SES
is _very_ low level.

The other things I found really frustrating about SES were: templates had to
be defined inline in a JSON file and then sent to SES via the AWS CLI. So,
since there's is no online/visual editor, copy changes and the like required a
developer to rebuild/sanitize/minify the template source and then update it
via the AWS CLI.

It also took _way_ longer than it should have to have our rate limit bumped up
to a reasonable level. IIRC, it took ~one week for my request to be processed
(after submitting proof that we owned the domain, etc.) and it was only after
a fit on twitter that AWS Support followed up with me and escalated the issue.

~~~
mountainofdeath
Amazon SES is supposed to be strictly transactional email. Using it for
anything else isn't really the intended purpose.

~~~
super-serial
No it's not. Right on the homepage they say it can be used for marketing
emails including newsletters:
[https://aws.amazon.com/ses/](https://aws.amazon.com/ses/)

If someone has different info let me know... we're migrating off Mailchimp and
already have some of our newsletters on SES at work.

------
ceejayoz
We moved off Mandrill back in 2016 when they changed their TOS effective
_immediately_ for one of their major use cases (sending newsletters).

I'm kinda glad now they forced the issue.

HN coverage of that event:
[https://news.ycombinator.com/item?id=11203056](https://news.ycombinator.com/item?id=11203056)

------
ian0
The email from support (only sent a few hours ago) stated:

> The impact to users could come in the form of not tracking opens, clicks,
> bounces, email sends, inbound email, webhook events, and more. Right now, it
> looks like the database outage is affecting up to 20% of our outbound volume
> as well as a majority of inbound email and webhooks.

It makes it seem like the actual sending emails were not effected, just
"tracking". I landed on this thread as some reporting emails of ours weren't
sent. Can anyone confirm it effected the sending of mail too?

Luckily we had nothing critical running through Mandrill, but I feel sorry for
those who did, given it hit right around CNY where many people will be on
holidays.

------
frereubu
There seem to be quite a few snarky comments about Mandrill here, but we've
been using it for a few years and until today we've been happy with the
service. I'd be interested to hear concrete reasons (apart from the current
outage!) why Mandrill isn't as good as other services for a use case where I
want reporting, bounce handling etc. as part of the service (e.g. not AWS
SES).

~~~
tyingq
This one seems constructive, concerning, outlines events prior to today, and
was posted before your post:

 _" Mandrill also recently got rid of their status page. A few months ago,
their API has started returning nginx errors, and their status page looked
like a Christmas tree; every reload would indicate different green/orange
statuses. The certificate their MX servers uses is invalid (wrong domain),
which prevents email delivery from compliant Danish email servers."_

~~~
frereubu
That's true. None of those affect the way I use Mandrill, which is perhaps why
I didn't pay too much attention to it, but it doesn't paint a good picture for
sure.

------
niftylettuce
If you need a service to use for email forwarding, try
[https://forwardemail.net](https://forwardemail.net)

Source Code: [https://github.com/niftylettuce/forward-
email](https://github.com/niftylettuce/forward-email)

------
jrockway
We use Mandrill to get inbound emails into our support ticketing system. This
incident has pushed me in the direction of switching to ZenDesk or
something... but according to their status page, they also lost inbound emails
yesterday. Coincidence or is it Mandrill all the way down?

~~~
f3r3nc
We have issues with both. Some emails do go out, some not. No inbound.

------
Dphilman
Wow, that's pretty bad. I'm sure a lot of companies were affected. No ETA on a
fix is really scary.

We currently use Law Ruler to send out our emails and have had no issues in
the last 2 years.

If anyone needs a new solution fast I would reach out to their support ( form
on website ).

------
Wouter33
Mailchimp just sent out an update per mail:
[https://pastebin.com/6TN10AZB](https://pastebin.com/6TN10AZB)

TLDR: One of their five Postgres clusters went into read-only mode due to an
Transaction ID Wraparound issue. Restoring this can take up to several days
(!).

------
bevacqua
Thrilled that I moved to mailgun when they announced the merge of Mandrill
into MailChimp

------
bks
If you just need outbound email submitted via smtp try outboundsmtp.com (I run
it)

------
olegd1280
ESP's like mailchimp, constant contact, and other always have some issues with
emails...mainly because of the contacts list..

I made the switch to mailclickconvert.com from constant conatct and its been
much better.

------
Animats
It's a spamming service, right? (Not, as one might think from the name, a gay
dating service.) Is it a big deal if it's down to anyone but the spammers?

There's an annoying tendency to combine spam services with actual transaction
reports. (A transaction report is "Your order has shipped and here is the
tracking number", not "We have a new product") Because everybody blocks the
spammers. Constant Contact went down that road, and now you can't reliably use
Constant Contact for transactions.

