- - - -
We’re contacting you about an ongoing outage with the Mandrill app. This email provides background on what happened and how users are affected, what we’re doing to address the issue, and what’s next for our customers.
Mandrill uses a sharded Postgres setup as one of our main datastores. On Sunday, February 3, at 10:30pm EST, 1 of our 5 physical Postgres instances saw a significant spike in writes. The spike in writes triggered a Transaction ID Wraparound issue. When this occurs, database activity is completely halted. The database sets itself in read-only mode until offline maintenance (known as vacuuming) can occur.
The database is large—running the vacuum process takes a significant amount of time and resources, and there’s no clear way to track progress.
The impact to users could come in the form of not tracking opens, clicks, bounces, email sends, inbound email, webhook events, and more. Right now, it looks like the database outage is affecting up to 20% of our outbound volume as well as a majority of inbound email and webhooks.
What we’re doing to address this
We don’t have an estimated time for when the vacuum process and cleanup work will be complete. While we have a parallel set of tasks going to try to get the database back in working order, these efforts are also slow and difficult with a database of this size. We’re trying everything we can to finish this process as quickly as possible, but this could take several days, or longer. We hope to have more information and a timeline for resolution soon.
In the meantime, it’s possible that you may see errors related to sending and receiving emails. We’ll continue to update you on our progress by email and let you know as soon as these issues are fully resolved.
We apologize for the disruption to your business. Once the outage is resolved, we plan to offer refunds to all affected users. You don’t need to take any action at this time—we’ll share details in a follow-up email and will automatically credit your account.
Again, we’re sorry for the interruption and we hope to have good news to share soon.
And this more recent one by Robert Haas:
As Josh states at the end of the third post the current best practices for dealing with this are really workarounds and as Robert states it requires monitoring and management. Postgres is an amazing piece of software and managing this is doable but IMHO this is one of Postgres' worst warts. It would be awesome if someone could donate some funding to improve this.
This is also a problem that gets far harder to fix once you've run into it. If you have sufficient transaction volume to potentially hit this, you need to monitor autovacuum and make adjustments early before you get close to the wraparound. If you don't, you suddenly have to perform all the vacuum work at once, blocking that table until it's done.
This is not usually the purpose of sharding though. Having a replica of each node or each block of data (and a good failover system) is what would allow you to pull a node offline with no impact. Though it's worth pointing out, even if they had a replica of the node in this case, the replica would probably experience XID wraparound at the same time so that probably wouldn't help.
Sharding usually means partitioning the data so that different data goes to different nodes. In this case that's consistent with 20% of outbound emails being affected if 1 of the 5 shards is down.
There are definitely some red flags with their usage though, like ideally only 20% of inbound emails and events would be affected as well but they said almost all of them are. And ideally you couldn't get into a situation to begin with where you have one shard getting way more writes than everything else. And of course ideally you're monitoring XIDs and can respond enough in advance. I'd be interested to read a more detailed writeup, though based on some of the comments here about their lack of transparency it seems unlikely that one will be released.
Don't get me wrong--I love relational databases and they are amazing pieces of technology. But they are incredibly hard to "do right" at scale while maintaining availability SLAs.
I would appreciate if downvoters would explain their decision to downvote, so that if I'm incorrect then I could at least update my beliefs. My position is based on years of experience watching relational databases maintained by professional DBAs catastrophically fail in strange ways, and subsequently taking a long time to recover, causing complete blackouts. And having yet to see such failures in managed NoSQL DBs like DynamoDB.
Relational databases are tried and true and we have learned from the failures and have only made the technology better.
There are many use cases from data modeling perspective where a relational db makes more sense than a no sql and you really have to understand the trade offs of consistency and durability too. There will always be a place for both technologies and its not a question of either/or but rather what makes sense for your application in terms of not only system scalability but data scalability.
The fact that transaction wrap around is so well-known is itself a red flag--apparently a lot of people have run into this issue, and yet it keeps being an issue. The blast radius is very large and the recovery is painful, as shown here by Mandrill. You should think twice before accepting that risk if you value your uptime.
If you want to become an expert on all these pitfalls and caveats of running relational databases at scale, at the expense of your availability and customer satisfaction--then by all means continue using relational databases. For many use cases, there are better options with better failure resiliency and recovery stories.