They are the opposite of a transparent organization - they are the GoDaddy of this business. (I was super pissed once they removed their status page..WTF). We hit these errors a few months ago with our clients and I put up a roadmap to move all of them off the terrible joke that Mandrill is, in under a month. None of my clients regretted the move.
I really don't miss anything about them and the market has caught up and they've been stagnant for way too long without any changes whatsoever and are still charging a premium for a prototype level quality of service.
If you're still using Mandrill in 2019, move out as soon as you can. I'm saying this with public interest as I had enough of these jokers.
Edit: I built a Mandrill clone built on SES, Phoenix/Elixir and hosted on Google AppEngine which is basically Google's managed hosting service. I moved everything there and it works pretty well, planning to open source it at some point. The awesome thing about this is I can work on some specific features my clients want that isn't available on vanilla Mandrill.
I chose this route because I lost a lot of money with Mandrill compensating my clients for their fault, so I calculated that the money I lost over time would have been the same amount if I had invested in building a Mandrill clone myself. I plan to open source it at some point.
I was already angry enough when mailchimp merged mandrill with its main service
Not surprising that free services would be abused, but was a little disappointing. I would assume the paid customers get servers with better reputation/history.
Not really surprising when some companies surely are using services like Mailchimp to send spam - depending on your definition of spam.
They have a status page. What does Mandrill and MailChimp combining have anything to do with SendGrid and Twilio?
Don't use Mandrill or Mailchimp for anything important.
Also are you using Amazon ses with Google app engine? Seems odd.
if youre not going to name where you moved your customers too, then who cares?
Mandrill stopped delivering email at 04:51 UTC on the 4th of February. Nobody knows whether the received emails are lost, what the cause is, how soon any change will occur. It took Mandrill over 9 hours to acknowledge the issue.
They are posting the same message over and over, and have now been silent for over 9 hours.
Mandrill also recently got rid of their status page. A few months ago, their API has started returning nginx errors, and their status page looked like a Christmas tree; every reload would indicate different green/orange statuses. The certificate their MX servers uses is invalid (wrong domain), which prevents email delivery from compliant Danish email servers.
Certain things like viewing email content look broken on certain emails which doesn't hurt the business, yet...
It's true though that this service is nearing death as hasn't had an update to any of its features since it merge with Mailchimp.
Looking to change service providers soon, as soon as we figure out how to render emails to html before sending. Mandrill has a `render` endpoint which makes this easy. None of the others have this yet.
The assignment I had was to write, coordinate translations, produce and send a quarterly newsletter with anywhere between 20 and 35 translations per issue + optional separate content to be included on a country-to-country basis. Production time from drafted newsletter to final send was about 2 weeks.
The Foundation tool felt to me a little like jekyll for email. I had a a lot of fun with it.
Disclaimer: I have no connection to Foundation. Just a delighted user.
You can't trust their interface during these outages, we have been there before, even though their Outbound page says "Delivered" it isn't unless there are SMTP events attached, if it looks like this, your email is queued, not sent: https://cdn.servnice.com/screenie/c1Og3TcrkLFV9Hg.jpg
Klaviyo provides similar functionality: https://www.klaviyo.com/docs/api/email-templates
Disclaimer: I work at Klaviyo.
Campaign Monitor have the ability to take an API call and render into HTML using templates, this might do what you want: https://www.campaignmonitor.com/features/transactional-email...
(I'm unaffiliated with CM, just a customer)
I'm afraid I don't have any answer for you, but I'm guessing quite a few people will be moving to sendgrid - I'm not sure how I'd be able to trust a supplier after this lack of response to an outage.
 - https://sendgrid.com/use-cases/transactional-email/
We're definitely a lot happier from that perspective (and obviously given this issue I'm glad we moved), and overall it's really good. A couple of things to be aware of:
- Unlike Mandrill you can't send with a test API key and then view the emails. We end up sending real emails to a test inbox, which has some pros and cons - but cons include it costing email sends, and Gmail being awkward and silently de-duping sometimes.
- You also can't view sent email content, at all (beyond the subject). Mandrill you can, Sendgrid don't seem to save it.
- Mandrill/others probably has the same issue, but SendGrid without your own IP sends emails to new recipients extremely slowly. And they lock your whole account while they're doing it, password resets won't send to existing recipients and nor will test emails to your test inbox.
- You also only get 3 days of email send history, you can pay for up to 30 I think. I'm not sure what the rule was with Mandrill but they seemed to keep a lot more available, though their search would time out most of the time so I'm not sure how much was really there and accessible.
Any email server which validates that the remote host name matches the cert will be be very sad. Doesn't matter if it's compliant (to what btw?), in practice email server owners are sloppy.
- - - -
We’re contacting you about an ongoing outage with the Mandrill app. This email provides background on what happened and how users are affected, what we’re doing to address the issue, and what’s next for our customers.
Mandrill uses a sharded Postgres setup as one of our main datastores. On Sunday, February 3, at 10:30pm EST, 1 of our 5 physical Postgres instances saw a significant spike in writes. The spike in writes triggered a Transaction ID Wraparound issue. When this occurs, database activity is completely halted. The database sets itself in read-only mode until offline maintenance (known as vacuuming) can occur.
The database is large—running the vacuum process takes a significant amount of time and resources, and there’s no clear way to track progress.
The impact to users could come in the form of not tracking opens, clicks, bounces, email sends, inbound email, webhook events, and more. Right now, it looks like the database outage is affecting up to 20% of our outbound volume as well as a majority of inbound email and webhooks.
What we’re doing to address this
We don’t have an estimated time for when the vacuum process and cleanup work will be complete. While we have a parallel set of tasks going to try to get the database back in working order, these efforts are also slow and difficult with a database of this size. We’re trying everything we can to finish this process as quickly as possible, but this could take several days, or longer. We hope to have more information and a timeline for resolution soon.
In the meantime, it’s possible that you may see errors related to sending and receiving emails. We’ll continue to update you on our progress by email and let you know as soon as these issues are fully resolved.
We apologize for the disruption to your business. Once the outage is resolved, we plan to offer refunds to all affected users. You don’t need to take any action at this time—we’ll share details in a follow-up email and will automatically credit your account.
Again, we’re sorry for the interruption and we hope to have good news to share soon.
And this more recent one by Robert Haas:
As Josh states at the end of the third post the current best practices for dealing with this are really workarounds and as Robert states it requires monitoring and management. Postgres is an amazing piece of software and managing this is doable but IMHO this is one of Postgres' worst warts. It would be awesome if someone could donate some funding to improve this.
This is also a problem that gets far harder to fix once you've run into it. If you have sufficient transaction volume to potentially hit this, you need to monitor autovacuum and make adjustments early before you get close to the wraparound. If you don't, you suddenly have to perform all the vacuum work at once, blocking that table until it's done.
This is not usually the purpose of sharding though. Having a replica of each node or each block of data (and a good failover system) is what would allow you to pull a node offline with no impact. Though it's worth pointing out, even if they had a replica of the node in this case, the replica would probably experience XID wraparound at the same time so that probably wouldn't help.
Sharding usually means partitioning the data so that different data goes to different nodes. In this case that's consistent with 20% of outbound emails being affected if 1 of the 5 shards is down.
There are definitely some red flags with their usage though, like ideally only 20% of inbound emails and events would be affected as well but they said almost all of them are. And ideally you couldn't get into a situation to begin with where you have one shard getting way more writes than everything else. And of course ideally you're monitoring XIDs and can respond enough in advance. I'd be interested to read a more detailed writeup, though based on some of the comments here about their lack of transparency it seems unlikely that one will be released.
Don't get me wrong--I love relational databases and they are amazing pieces of technology. But they are incredibly hard to "do right" at scale while maintaining availability SLAs.
I would appreciate if downvoters would explain their decision to downvote, so that if I'm incorrect then I could at least update my beliefs. My position is based on years of experience watching relational databases maintained by professional DBAs catastrophically fail in strange ways, and subsequently taking a long time to recover, causing complete blackouts. And having yet to see such failures in managed NoSQL DBs like DynamoDB.
Relational databases are tried and true and we have learned from the failures and have only made the technology better.
There are many use cases from data modeling perspective where a relational db makes more sense than a no sql and you really have to understand the trade offs of consistency and durability too. There will always be a place for both technologies and its not a question of either/or but rather what makes sense for your application in terms of not only system scalability but data scalability.
The fact that transaction wrap around is so well-known is itself a red flag--apparently a lot of people have run into this issue, and yet it keeps being an issue. The blast radius is very large and the recovery is painful, as shown here by Mandrill. You should think twice before accepting that risk if you value your uptime.
If you want to become an expert on all these pitfalls and caveats of running relational databases at scale, at the expense of your availability and customer satisfaction--then by all means continue using relational databases. For many use cases, there are better options with better failure resiliency and recovery stories.
Some folks on the email thread were personally involved in handling major outages in the early days. We've had to learn a lot of hard lessons since then. Even when everything seems like it's going fine ("wow, we're growing so fast, good problems!"), scaling issues could be right around the corner.
Anyone with a large enough installation of Postgres could've had the wraparound issue we're seeing right now. That's why it's important to monitor for what could go wrong, detect these issues early, and provide customers with rapid communication so they can plan around it.
Sending our best wishes to the MailChimp engineering teams working on the problem right now. Good luck, you've got this!
(Worth noting, that I'm a happy long-time Mailchimp customer)
oh, so normal operation then
The other things I found really frustrating about SES were: templates had to be defined inline in a JSON file and then sent to SES via the AWS CLI. So, since there's is no online/visual editor, copy changes and the like required a developer to rebuild/sanitize/minify the template source and then update it via the AWS CLI.
It also took _way_ longer than it should have to have our rate limit bumped up to a reasonable level. IIRC, it took ~one week for my request to be processed (after submitting proof that we owned the domain, etc.) and it was only after a fit on twitter that AWS Support followed up with me and escalated the issue.
If someone has different info let me know... we're migrating off Mailchimp and already have some of our newsletters on SES at work.
I'm kinda glad now they forced the issue.
HN coverage of that event: https://news.ycombinator.com/item?id=11203056
> The impact to users could come in the form of not tracking opens, clicks, bounces, email sends, inbound email, webhook events, and more. Right now, it looks like the database outage is affecting up to 20% of our outbound volume as well as a majority of inbound email and webhooks.
It makes it seem like the actual sending emails were not effected, just "tracking". I landed on this thread as some reporting emails of ours weren't sent. Can anyone confirm it effected the sending of mail too?
Luckily we had nothing critical running through Mandrill, but I feel sorry for those who did, given it hit right around CNY where many people will be on holidays.
"Mandrill also recently got rid of their status page. A few months ago, their API has started returning nginx errors, and their status page looked like a Christmas tree; every reload would indicate different green/orange statuses. The certificate their MX servers uses is invalid (wrong domain), which prevents email delivery from compliant Danish email servers."
Source Code: https://github.com/niftylettuce/forward-email
We currently use Law Ruler to send out our emails and have had no issues in the last 2 years.
If anyone needs a new solution fast I would reach out to their support ( form on website ).
TLDR: One of their five Postgres clusters went into read-only mode due to an Transaction ID Wraparound issue. Restoring this can take up to several days (!).
I made the switch to mailclickconvert.com from constant conatct and its been much better.
Please provide proof when making claims like this. Infowars/Alex Jones doesn't count, they were universally blacklisted.
There's an annoying tendency to combine spam services with actual transaction reports. (A transaction report is "Your order has shipped and here is the tracking number", not "We have a new product") Because everybody blocks the spammers. Constant Contact went down that road, and now you can't reliably use Constant Contact for transactions.