Hacker News new | more | comments | ask | show | jobs | submit login
Mandrill has been down for over 30 hours with no explanation (twitter.com)
290 points by slau 18 days ago | hide | past | web | favorite | 110 comments

Unfortunately this is standard MailChimp way of doing things ever since they screwed over paying customers and merged with Mandrill [1].

They are the opposite of a transparent organization - they are the GoDaddy of this business. (I was super pissed once they removed their status page..WTF). We hit these errors a few months ago with our clients and I put up a roadmap to move all of them off the terrible joke that Mandrill is, in under a month. None of my clients regretted the move.

I really don't miss anything about them and the market has caught up and they've been stagnant for way too long without any changes whatsoever and are still charging a premium for a prototype level quality of service.

If you're still using Mandrill in 2019, move out as soon as you can. I'm saying this with public interest as I had enough of these jokers.

[1] http://www.dangrossman.info/2016/02/28/mandrills-betrayal/

Edit: I built a Mandrill clone built on SES, Phoenix/Elixir and hosted on Google AppEngine which is basically Google's managed hosting service. I moved everything there and it works pretty well, planning to open source it at some point. The awesome thing about this is I can work on some specific features my clients want that isn't available on vanilla Mandrill.

I chose this route because I lost a lot of money with Mandrill compensating my clients for their fault, so I calculated that the money I lost over time would have been the same amount if I had invested in building a Mandrill clone myself. I plan to open source it at some point.

Move to mailgun https://www.mailgun.com/

I was already angry enough when mailchimp merged mandrill with its main service

Loved mailgun when I was using it a few years ago for some small campaigns. Unfortunately the free tier service had a big problem with server reputation. A noticable amount of emails were spam-flagged or rejected by recipient systems.

Not surprising that free services would be abused, but was a little disappointing. I would assume the paid customers get servers with better reputation/history.

I had the same issue, sent an email to support and they moved us to a different node/cluster/whatever immediately. I’ve never had another problem on a free or paid tier.

This problem doesn't seem to be limited to free services or smaller companies. Mailing list stuff sent via Mailchimp often gets caught by my spam filter, and many Mailchimp IPs are on various blacklists used by spam filter software.

Not really surprising when some companies surely are using services like Mailchimp to send spam - depending on your definition of spam.

What do you recommend? I'm on their lowest cost tier and am seeing some of the issues you describe.

I moved to sendgrid when they announced mandrill's merger and have never been happier.

We moved to sendgrid around the same time. Our deliverability took a hit, and I'm not the keenest on their very narrow window of email data they keep... but no real shenanigans, very straight forward, and the one significant outage of theirs I can remember they handled it well.

SendGrid used to have a free plan but things have changed since being acquired. I moved to SendGrid anyways though and use their free 25,000 emails / month on Azure even though our app is hosted elsewhere:


Funny - I was looking at SendGrid on Twitter and looks like they've had some issues themselves over the last few weeks: https://twitter.com/sendgrid_ops Did you notice anything, or were these very small outages? For all that, they've obviously been a good deal more communicative than Mandrill when they have had issues.

Their API has been very reliable for us. They do have mail delivery delays somewhat frequently (see [1]), but we rarely notice them. They are extremely communicative on their status page, at least.

[1]: http://status.sendgrid.com/history?page=1

Except now, sendgrid is merged with twilio; so... history repeating?

We use both Send Grid and Twilio. Twilio has been outstanding and their API is outstanding. I’m not worried about this merger.

No. Twilio has a status page :)

And twilio seems to be very serious about uptime. Not sure if it translated to all the facets of their business, but phone services being down will lose you customers quickly.


They have a status page. What does Mandrill and MailChimp combining have anything to do with SendGrid and Twilio?

100% agree. Beyond the change to TOS, they also quadrupled Mandrill prices with almost no notice forcing companies with high volume to scramble (or pay massive amounts).

Don't use Mandrill or Mailchimp for anything important.

Yep, I remember they even turned off the comments on their blog for that post. I have never seen a more arrogant, scammy company than them in this vertical.

yeah i just noticed the status page was removed - it used to actually be pretty good!

What an odd trend. If anything every company is clamoring to get a status page, not remove the one they have.

Where did you move them to?

I built a Mandrill clone based on SES, Google AppEngine, Phoenix/Elixir. I moved everything there and it works pretty well, planning to open source it at some point.

How about you just open source it as a blog post and let someone else's clean it up. Heck even blogging the architecture and the issues would help.

Also are you using Amazon ses with Google app engine? Seems odd.

Google doesn't offer a transactional email service.

Google have no equivalent to ses.

That sounds awesome! Please post a show hn when you do.

Sure, thank you :)

You might have a viable saas there. Check out email octopus and moonmail (also open source)

Thank you! Moon mail looks interesting. Will check them out in detail later as to what they're using.

svnpenn 17 days ago [flagged]

I really do hate posts like this.

if youre not going to name where you moved your customers too, then who cares?

Thanks for the feedback, I'll edit my post.

Disclaimer: it's my tweet that is linked in this story, and I submitted it. There's no news story as far as I know. Better links are appreciated.

Mandrill stopped delivering email at 04:51 UTC on the 4th of February. Nobody knows whether the received emails are lost, what the cause is, how soon any change will occur. It took Mandrill over 9 hours to acknowledge the issue.

They are posting the same message over and over, and have now been silent for over 9 hours.

Mandrill also recently got rid of their status page. A few months ago, their API has started returning nginx errors, and their status page looked like a Christmas tree; every reload would indicate different green/orange statuses. The certificate their MX servers uses is invalid (wrong domain), which prevents email delivery from compliant Danish email servers.

We're currently using Mandrill to send outbound emails, although we've had issues with some API calls, all the outbound emails we sent have been delivered.

Certain things like viewing email content look broken on certain emails which doesn't hurt the business, yet...

It's true though that this service is nearing death as hasn't had an update to any of its features since it merge with Mailchimp.

Looking to change service providers soon, as soon as we figure out how to render emails to html before sending. Mandrill has a `render` endpoint which makes this easy. None of the others have this yet.

A few years ago when Mandrill dramatically increased prices there was a big exodus of HN users. I solicited pricing and transition info from a number of different providers and summarized it here[0].

[0] http://gabe.durazo.us/tech/hacker-news-mandrill-followup/

Can't recommend Foundation for Emails highly enough for rendering html. I found it super easy to use (Moustache templates) and very flexible for managing multiple versions (translations) of a newsletter.

The assignment I had was to write, coordinate translations, produce and send a quarterly newsletter with anywhere between 20 and 35 translations per issue + optional separate content to be included on a country-to-country basis. Production time from drafted newsletter to final send was about 2 weeks.

The Foundation tool felt to me a little like jekyll for email. I had a a lot of fun with it.


Disclaimer: I have no connection to Foundation. Just a delighted user.

> all the outbound emails we sent have been delivered.

You can't trust their interface during these outages, we have been there before, even though their Outbound page says "Delivered" it isn't unless there are SMTP events attached, if it looks like this, your email is queued, not sent: https://cdn.servnice.com/screenie/c1Og3TcrkLFV9Hg.jpg

The weirdest UX problem I've found with Mandrill is that you can't use the browser "CTRL-F" to search an email template unless the text is _currently_ visible. Not sure how they manged to mess up the search functionality for a text box that is intended to have massive amounts of text in it. It makes it harder to do a minor spot edit of a template than it should be. Things begin to make sense now...

I believe this is a performance optimization to limit what's actually in the DOM for large documents. The code editor is fairly common and is also used by Google Tag Manager, but it annoys the heck out of me too.

We’ve been using https://MJML.io and Handlebars and owning the rendering on our end. It’s been a breeze.

Not that it's a great alternative, but SendGrid uses Handlebars templates so it's pretty easy to render it yourself using any Handlebars library. You could probably quite easily write an AWS Lambda or similar that fetches a template using their API, populates it with data you post and returns the HTML.

> Mandrill has a `render` endpoint which makes this easy. None of the others have this yet.

Klaviyo provides similar functionality: https://www.klaviyo.com/docs/api/email-templates

Disclaimer: I work at Klaviyo.

Yes, outbound has been solid for us, too. The issue is inbound. We literally have hundreds if not thousands of customer support agents unable to work because of this.

I've had lots of issues with outbound. Bulk sends failing completely. Individual seems to work but is very slow.

> Mandrill has a `render` endpoint which makes this easy. None of the others have this yet.

Campaign Monitor have the ability to take an API call and render into HTML using templates, this might do what you want: https://www.campaignmonitor.com/features/transactional-email...

(I'm unaffiliated with CM, just a customer)

Take a look at MJML.

Already using MJML to design email templates and wrapping <mj-raw> tags around Mailchimp conditional tags like IF:SOMETHING .. |END:IF|. I guess it's time to render the html emails ourselves, merge with params and then use whatever other third party transactional email service.

Sounds pretty rubbish for you and their other customers :(

I'm afraid I don't have any answer for you, but I'm guessing quite a few people will be moving to sendgrid[1] - I'm not sure how I'd be able to trust a supplier after this lack of response to an outage.

[1] - https://sendgrid.com/use-cases/transactional-email/

Recently moved our company from Mandrill to SendGrid - mainly because we were having deliverability issues with Mandrill.

We're definitely a lot happier from that perspective (and obviously given this issue I'm glad we moved), and overall it's really good. A couple of things to be aware of:

- Unlike Mandrill you can't send with a test API key and then view the emails. We end up sending real emails to a test inbox, which has some pros and cons - but cons include it costing email sends, and Gmail being awkward and silently de-duping sometimes.

- You also can't view sent email content, at all (beyond the subject). Mandrill you can, Sendgrid don't seem to save it.

- Mandrill/others probably has the same issue, but SendGrid without your own IP sends emails to new recipients extremely slowly. And they lock your whole account while they're doing it, password resets won't send to existing recipients and nor will test emails to your test inbox.

- You also only get 3 days of email send history, you can pay for up to 30 I think. I'm not sure what the rule was with Mandrill but they seemed to keep a lot more available, though their search would time out most of the time so I'm not sure how much was really there and accessible.

Mandrill is 30 days of send history with 90 days of stats.

We're actually moving to SES as we speak, and the move should be done in a few hours. It's something we had planned, but it just sucks that this is how a paid service gets terminated.

So it’s dead, then? At least those are not the signs of a healthy operation that just made a rectifiable mistake. Combined with the length of downtime, sounds really bad...

> which prevents email delivery from compliant Danish email servers.

Any email server which validates that the remote host name matches the cert will be be very sad. Doesn't matter if it's compliant (to what btw?), in practice email server owners are sloppy.

The Danish authorities have apparently mandated TLS 1.2 when sending certain mails, due to their interpretation of GDPR. I don't have the details though.

Got this email just now.

- - - -


We’re contacting you about an ongoing outage with the Mandrill app. This email provides background on what happened and how users are affected, what we’re doing to address the issue, and what’s next for our customers.

What happened Mandrill uses a sharded Postgres setup as one of our main datastores. On Sunday, February 3, at 10:30pm EST, 1 of our 5 physical Postgres instances saw a significant spike in writes. The spike in writes triggered a Transaction ID Wraparound issue. When this occurs, database activity is completely halted. The database sets itself in read-only mode until offline maintenance (known as vacuuming) can occur.

The database is large—running the vacuum process takes a significant amount of time and resources, and there’s no clear way to track progress.

Customer impact The impact to users could come in the form of not tracking opens, clicks, bounces, email sends, inbound email, webhook events, and more. Right now, it looks like the database outage is affecting up to 20% of our outbound volume as well as a majority of inbound email and webhooks.

What we’re doing to address this We don’t have an estimated time for when the vacuum process and cleanup work will be complete. While we have a parallel set of tasks going to try to get the database back in working order, these efforts are also slow and difficult with a database of this size. We’re trying everything we can to finish this process as quickly as possible, but this could take several days, or longer. We hope to have more information and a timeline for resolution soon.

In the meantime, it’s possible that you may see errors related to sending and receiving emails. We’ll continue to update you on our progress by email and let you know as soon as these issues are fully resolved.

What’s next We apologize for the disruption to your business. Once the outage is resolved, we plan to offer refunds to all affected users. You don’t need to take any action at this time—we’ll share details in a follow-up email and will automatically credit your account.

Again, we’re sorry for the interruption and we hope to have good news to share soon.

If you're looking for a good series of blog posts about xid wraparound in Postgres check out these posts by Josh Berkus:




And this more recent one by Robert Haas:


As Josh states at the end of the third post the current best practices for dealing with this are really workarounds and as Robert states it requires monitoring and management. Postgres is an amazing piece of software and managing this is doable but IMHO this is one of Postgres' worst warts. It would be awesome if someone could donate some funding to improve this.

My admittedly very superficial understanding of this issue is that the most common way to run into the xid wraparound problem is tuning the autovacuum in the wrong direction. So you notice that vacuum is taking up a lot of your servers resources, and decrease the frequency. Or you notice that it can't really keep up, but don't tune it to be more aggressive or provide enough resources for it to do its job. Or you don't monitor this at all, which is a pretty bad idea if you do billions of transactions (with less you can't really hit this issue).

This is also a problem that gets far harder to fix once you've run into it. If you have sufficient transaction volume to potentially hit this, you need to monitor autovacuum and make adjustments early before you get close to the wraparound. If you don't, you suddenly have to perform all the vacuum work at once, blocking that table until it's done.

Why does one shard being down remove all their inbound functionality? I'm struggling to understand the purpose of sharding if you can't pull a node offline and replace it while you deal with the wraparound. Is it part of postgres that if one shard has an issue, the entire cluster goes into read only mode?

> I'm struggling to understand the purpose of sharding if you can't pull a node offline and replace it while you deal with the wraparound.

This is not usually the purpose of sharding though. Having a replica of each node or each block of data (and a good failover system) is what would allow you to pull a node offline with no impact. Though it's worth pointing out, even if they had a replica of the node in this case, the replica would probably experience XID wraparound at the same time so that probably wouldn't help.

Sharding usually means partitioning the data so that different data goes to different nodes. In this case that's consistent with 20% of outbound emails being affected if 1 of the 5 shards is down.

There are definitely some red flags with their usage though, like ideally only 20% of inbound emails and events would be affected as well but they said almost all of them are. And ideally you couldn't get into a situation to begin with where you have one shard getting way more writes than everything else. And of course ideally you're monitoring XIDs and can respond enough in advance. I'd be interested to read a more detailed writeup, though based on some of the comments here about their lack of transparency it seems unlikely that one will be released.

Ha! For once it’s not MongoDB but Postgres. I wonder why the sending is effected though. Can’t they run their service with an empty databse in the meantime?

Seems to possibly be the same issue as that discussed on another frontpage article: https://andreas.scherbaum.la/blog/archives/970-How-long-will... / https://news.ycombinator.com/item?id=19082944

I use Mandrill and haven't received any status email from them.

If you care about scalability and availability simultaneously, I'm not sure in these modern times why you would use a relational database. When they fail, they fail catastrophically and are difficult to recover, as this failure event (and the never-ending stream of failure events posted to HN) demonstrates.

Don't get me wrong--I love relational databases and they are amazing pieces of technology. But they are incredibly hard to "do right" at scale while maintaining availability SLAs.


I would appreciate if downvoters would explain their decision to downvote, so that if I'm incorrect then I could at least update my beliefs. My position is based on years of experience watching relational databases maintained by professional DBAs catastrophically fail in strange ways, and subsequently taking a long time to recover, causing complete blackouts. And having yet to see such failures in managed NoSQL DBs like DynamoDB.

What is your point? It was a 6 hour brownout, not a 30+ hour blackout. It is very unlikely that this kind of outage will happen again for DynamoDB. How likely is someone else going to run into a transaction wrap around again? If it's such a well-known issue, then presumably it keeps happening to a lot of people.

Transaction wrap around is very well known issue and easy to avoid with autovacumming.

Relational databases are tried and true and we have learned from the failures and have only made the technology better.

There are many use cases from data modeling perspective where a relational db makes more sense than a no sql and you really have to understand the trade offs of consistency and durability too. There will always be a place for both technologies and its not a question of either/or but rather what makes sense for your application in terms of not only system scalability but data scalability.

I'm not saying that you should never use relational databases. But if you are running at a large scale and have tight availability SLAs...then consider not using relational databases.

The fact that transaction wrap around is so well-known is itself a red flag--apparently a lot of people have run into this issue, and yet it keeps being an issue. The blast radius is very large and the recovery is painful, as shown here by Mandrill. You should think twice before accepting that risk if you value your uptime.

If you want to become an expert on all these pitfalls and caveats of running relational databases at scale, at the expense of your availability and customer satisfaction--then by all means continue using relational databases. For many use cases, there are better options with better failure resiliency and recovery stories.

Just remember that you can't trust their interface, even though their Outbound page says "Delivered" it isn't delivered unless there are SMTP events attached, if it looks like this, your email is queued, not sent: https://cdn.servnice.com/screenie/c1Og3TcrkLFV9Hg.jpg

Wait, what? So delivered doesn’t mean delivered... who thought this was acceptable UX?

exactly! this was the main reason I moved away.

A few of my teammates at SendGrid have been following the situation, and we definitely feel for the engineers who are scrambling to fix the problem. It's never fun to get paged, especially when the trust of your customers on the line.

Some folks on the email thread were personally involved in handling major outages in the early days. We've had to learn a lot of hard lessons since then. Even when everything seems like it's going fine ("wow, we're growing so fast, good problems!"), scaling issues could be right around the corner.

Anyone with a large enough installation of Postgres could've had the wraparound issue we're seeing right now. That's why it's important to monitor for what could go wrong, detect these issues early, and provide customers with rapid communication so they can plan around it.

Sending our best wishes to the MailChimp engineering teams working on the problem right now. Good luck, you've got this!

We moved over to SES a long time ago. Mandrill was basically the cheapest game in town, but their terrible service coupled with Mailchimp essentially abandoning the platform after their acquisition made us jump ship. I feel bad for anyone still using their service.

Depends on your volume. Mandril has a monthly fee and thanks to pay as you go pricing SES is cheaper up to a decent volume for a lot of startups sending just transactional email.

This is honestly the most poorly communicated downtime I've ever experienced. Now, they give a non-update saying they're going to send an email (yes, email) about it later today: https://twitter.com/mandrillapp/status/1092810757929086977

Sending #HugOps over to the Mandrill Ops team. As frustrating it is to not have a service you depend on be available, if you've been around long enough, you've probably been on the other end of that kind of outage and know it isn't fun, expected or malicious.

Things like this influence stakeholder decisions down the road. I've been evaluating migrating to Mandrill for some time to give our designer more control over transactional emails. Now, it's unlikely that I'll do so.

(Worth noting, that I'm a happy long-time Mailchimp customer)

We used Mandrill for some time, then moved to SendGrid, and just recently have moved again to Postmark because we were seeing delivery issues. FWIW, Postmark has been fantastic.

Postmark is the best. Rock solid, and a singular focus on one thing. I've been using it for years.

Postmark is awesome!

> MOST outbound emails are sending

oh, so normal operation then

for context: here's a partial view of our outbound error rate over the past 60 hours (this is maybe ~30% of all total mandrill errors): https://puu.sh/CHDxb/729e52451f.png. We see around a 2% failure rate in Mandrill requests on a normal day though, its bad enough we do two immediate retries before allowing our queuing system to start handling retries with backoff.

Oh boy, that twitter thread is brutal... It looks like Mandrill's staff has gone MIA.

I don't care about Mandrill (who uses it, anyway), but deeply concerned if I should expect same quality, incident response and attitude to customers from parent company — Mailchimp.

You most definitely should expect this and much worse from MailChimp - see https://news.ycombinator.com/item?id=18715866.

What’s a good alternative to Mailchimp?

That's what I was thinking. Who hasn't moved to Mailgun already?

We've been on Mailgun for 5 years and they've been very reliable. We almost switched to Mandrill, thinking it was better for 'transactional' type emails but now I'm glad we did not! Their customer service is always responsive, too.

I just setup an account with mailgun. Pleasantly surprised at how easy it was.

Why not AWS SES?

AWS SES doesn't have things like reporting, bounce handling etc. out of the box - you need to set up all of that yourself (or at least you did when I last looked at it a few years ago).

This was all still true ~6 months ago. Compared to Mailchimp and friends, SES is _very_ low level.

The other things I found really frustrating about SES were: templates had to be defined inline in a JSON file and then sent to SES via the AWS CLI. So, since there's is no online/visual editor, copy changes and the like required a developer to rebuild/sanitize/minify the template source and then update it via the AWS CLI.

It also took _way_ longer than it should have to have our rate limit bumped up to a reasonable level. IIRC, it took ~one week for my request to be processed (after submitting proof that we owned the domain, etc.) and it was only after a fit on twitter that AWS Support followed up with me and escalated the issue.

Amazon SES is supposed to be strictly transactional email. Using it for anything else isn't really the intended purpose.

No it's not. Right on the homepage they say it can be used for marketing emails including newsletters: https://aws.amazon.com/ses/

If someone has different info let me know... we're migrating off Mailchimp and already have some of our newsletters on SES at work.

We moved off Mandrill back in 2016 when they changed their TOS effective immediately for one of their major use cases (sending newsletters).

I'm kinda glad now they forced the issue.

HN coverage of that event: https://news.ycombinator.com/item?id=11203056

The email from support (only sent a few hours ago) stated:

> The impact to users could come in the form of not tracking opens, clicks, bounces, email sends, inbound email, webhook events, and more. Right now, it looks like the database outage is affecting up to 20% of our outbound volume as well as a majority of inbound email and webhooks.

It makes it seem like the actual sending emails were not effected, just "tracking". I landed on this thread as some reporting emails of ours weren't sent. Can anyone confirm it effected the sending of mail too?

Luckily we had nothing critical running through Mandrill, but I feel sorry for those who did, given it hit right around CNY where many people will be on holidays.

There seem to be quite a few snarky comments about Mandrill here, but we've been using it for a few years and until today we've been happy with the service. I'd be interested to hear concrete reasons (apart from the current outage!) why Mandrill isn't as good as other services for a use case where I want reporting, bounce handling etc. as part of the service (e.g. not AWS SES).

This one seems constructive, concerning, outlines events prior to today, and was posted before your post:

"Mandrill also recently got rid of their status page. A few months ago, their API has started returning nginx errors, and their status page looked like a Christmas tree; every reload would indicate different green/orange statuses. The certificate their MX servers uses is invalid (wrong domain), which prevents email delivery from compliant Danish email servers."

That's true. None of those affect the way I use Mandrill, which is perhaps why I didn't pay too much attention to it, but it doesn't paint a good picture for sure.

If you need a service to use for email forwarding, try https://forwardemail.net

Source Code: https://github.com/niftylettuce/forward-email

We use Mandrill to get inbound emails into our support ticketing system. This incident has pushed me in the direction of switching to ZenDesk or something... but according to their status page, they also lost inbound emails yesterday. Coincidence or is it Mandrill all the way down?

We have issues with both. Some emails do go out, some not. No inbound.

Wow, that's pretty bad. I'm sure a lot of companies were affected. No ETA on a fix is really scary.

We currently use Law Ruler to send out our emails and have had no issues in the last 2 years.

If anyone needs a new solution fast I would reach out to their support ( form on website ).

Mailchimp just sent out an update per mail: https://pastebin.com/6TN10AZB

TLDR: One of their five Postgres clusters went into read-only mode due to an Transaction ID Wraparound issue. Restoring this can take up to several days (!).

Thrilled that I moved to mailgun when they announced the merge of Mandrill into MailChimp

If you just need outbound email submitted via smtp try outboundsmtp.com (I run it)

ESP's like mailchimp, constant contact, and other always have some issues with emails...mainly because of the contacts list..

I made the switch to mailclickconvert.com from constant conatct and its been much better.


> They are also known to remove accounts on the system based on these partisan politics.

Please provide proof when making claims like this. Infowars/Alex Jones doesn't count, they were universally blacklisted.

ROFL. Please provide proof they remove accounts based on partisan politics! Except for that case that they removed an account based on partisan politics!

And you think an isolated incident somehow proves a trend or is otherwise concrete evidence of repeat violations? There's zero evidence of a systemic problem.

It's a spamming service, right? (Not, as one might think from the name, a gay dating service.) Is it a big deal if it's down to anyone but the spammers?

There's an annoying tendency to combine spam services with actual transaction reports. (A transaction report is "Your order has shipped and here is the tracking number", not "We have a new product") Because everybody blocks the spammers. Constant Contact went down that road, and now you can't reliably use Constant Contact for transactions.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact