Got this email just now. - - - - Hello, We’re contacting you about an ongoing ou...

MechanicalTwerk · on Feb 5, 2019

If you're looking for a good series of blog posts about xid wraparound in Postgres check out these posts by Josh Berkus:

http://www.databasesoup.com/2012/09/freezing-your-tuples-off...

http://www.databasesoup.com/2012/10/freezing-your-tuples-off...

http://www.databasesoup.com/2012/12/freezing-your-tuples-off...

And this more recent one by Robert Haas:

https://rhaas.blogspot.com/2018/01/the-state-of-vacuum.html

As Josh states at the end of the third post the current best practices for dealing with this are really workarounds and as Robert states it requires monitoring and management. Postgres is an amazing piece of software and managing this is doable but IMHO this is one of Postgres' worst warts. It would be awesome if someone could donate some funding to improve this.

fabian2k · on Feb 5, 2019

My admittedly very superficial understanding of this issue is that the most common way to run into the xid wraparound problem is tuning the autovacuum in the wrong direction. So you notice that vacuum is taking up a lot of your servers resources, and decrease the frequency. Or you notice that it can't really keep up, but don't tune it to be more aggressive or provide enough resources for it to do its job. Or you don't monitor this at all, which is a pretty bad idea if you do billions of transactions (with less you can't really hit this issue).

This is also a problem that gets far harder to fix once you've run into it. If you have sufficient transaction volume to potentially hit this, you need to monitor autovacuum and make adjustments early before you get close to the wraparound. If you don't, you suddenly have to perform all the vacuum work at once, blocking that table until it's done.

jmagoon · on Feb 5, 2019

Why does one shard being down remove all their inbound functionality? I'm struggling to understand the purpose of sharding if you can't pull a node offline and replace it while you deal with the wraparound. Is it part of postgres that if one shard has an issue, the entire cluster goes into read only mode?

dcosson · on Feb 5, 2019

> I'm struggling to understand the purpose of sharding if you can't pull a node offline and replace it while you deal with the wraparound.

This is not usually the purpose of sharding though. Having a replica of each node or each block of data (and a good failover system) is what would allow you to pull a node offline with no impact. Though it's worth pointing out, even if they had a replica of the node in this case, the replica would probably experience XID wraparound at the same time so that probably wouldn't help.

Sharding usually means partitioning the data so that different data goes to different nodes. In this case that's consistent with 20% of outbound emails being affected if 1 of the 5 shards is down.

There are definitely some red flags with their usage though, like ideally only 20% of inbound emails and events would be affected as well but they said almost all of them are. And ideally you couldn't get into a situation to begin with where you have one shard getting way more writes than everything else. And of course ideally you're monitoring XIDs and can respond enough in advance. I'd be interested to read a more detailed writeup, though based on some of the comments here about their lack of transparency it seems unlikely that one will be released.

kawsper · on Feb 5, 2019

Its also shared here: https://mailchi.mp/953cb079837a/important-information-about-...

hartator · on Feb 5, 2019

Ha! For once it’s not MongoDB but Postgres. I wonder why the sending is effected though. Can’t they run their service with an empty databse in the meantime?

mcintyre1994 · on Feb 5, 2019

Seems to possibly be the same issue as that discussed on another frontpage article: https://andreas.scherbaum.la/blog/archives/970-How-long-will... / https://news.ycombinator.com/item?id=19082944

nodesocket · on Feb 5, 2019

I use Mandrill and haven't received any status email from them.

cle · on Feb 5, 2019

If you care about scalability and availability simultaneously, I'm not sure in these modern times why you would use a relational database. When they fail, they fail catastrophically and are difficult to recover, as this failure event (and the never-ending stream of failure events posted to HN) demonstrates.

Don't get me wrong--I love relational databases and they are amazing pieces of technology. But they are incredibly hard to "do right" at scale while maintaining availability SLAs.

edit:

I would appreciate if downvoters would explain their decision to downvote, so that if I'm incorrect then I could at least update my beliefs. My position is based on years of experience watching relational databases maintained by professional DBAs catastrophically fail in strange ways, and subsequently taking a long time to recover, causing complete blackouts. And having yet to see such failures in managed NoSQL DBs like DynamoDB.

I_have_receipts · on Feb 5, 2019

https://aws.amazon.com/message/5467D2/

cle · on Feb 11, 2019

What is your point? It was a 6 hour brownout, not a 30+ hour blackout. It is very unlikely that this kind of outage will happen again for DynamoDB. How likely is someone else going to run into a transaction wrap around again? If it's such a well-known issue, then presumably it keeps happening to a lot of people.

badmadrad · on Feb 6, 2019

Transaction wrap around is very well known issue and easy to avoid with autovacumming.

Relational databases are tried and true and we have learned from the failures and have only made the technology better.

There are many use cases from data modeling perspective where a relational db makes more sense than a no sql and you really have to understand the trade offs of consistency and durability too. There will always be a place for both technologies and its not a question of either/or but rather what makes sense for your application in terms of not only system scalability but data scalability.

cle · on Feb 11, 2019

I'm not saying that you should never use relational databases. But if you are running at a large scale and have tight availability SLAs...then consider not using relational databases.

The fact that transaction wrap around is so well-known is itself a red flag--apparently a lot of people have run into this issue, and yet it keeps being an issue. The blast radius is very large and the recovery is painful, as shown here by Mandrill. You should think twice before accepting that risk if you value your uptime.

If you want to become an expert on all these pitfalls and caveats of running relational databases at scale, at the expense of your availability and customer satisfaction--then by all means continue using relational databases. For many use cases, there are better options with better failure resiliency and recovery stories.