Hacker News new | comments | show | ask | jobs | submit login
PostgreSQL HA cluster failure: a post-mortem (gocardless.com)
322 points by stevedomin 7 months ago | hide | past | web | favorite | 117 comments

I’ve been involved in various Postgres roles (developer on it, consultant for it for BigCo, using it at my startup now) for around 18 years.

I’ve never in that time seen a pacemaker/corosync/etc/etc configuration go well. Ever. I have seen corrupted DBs, fail overs for no reason, etc. The worst things always happen when the failover doesn’t go according to plan and someone accidentally nukes the DB at 2am.

The lesson I’ve taken from this is it’s better to have 15-20 minutes of downtime in the unlikely event a primary goes down and run a manual failover/takeover script then it is to rely on automation. pgbouncer makes this easy enough.

That said, there was a lot of bad luck involved in this incident.

Our PostgreSQL failover plan is very pedestrian, but it works well (even across datacenters). We run streaming replication off of the primary to a pair of replicas, with one replica in another datacenter. The primary write DB advertises a loopback IP out OSPF into a top-of-rack switch, where it's aggregated by BGP and distributed throughout our network. There's a health check script [0] running every 3 seconds that makes sure PG is happy and that it is still writeable.

If we want to failover (nothing automatic), we stop the primary (or it's already dead) and the route is withdrawn. An operator touches the recovery file on new primary, the health checker sees that, and the IP is announced back into the network. Yes, it's a "VIP", but it's one controlled by our operations team, not automation software. One nice things about this is that you can failover across datacenters (remember it's advertised into our network over BGP) without reconfiguring DNS or messing with application servers.

While the mechanisms are different, we do something very similar for MySQL with MHA. It's still an operator running scripts intentionally though (which is what we want).

I will definitely agree with you that manual operator intervention is better than automated failover.

[0] https://github.com/unixsurfer/anycast_healthchecker

I'd love to read more about this.

If you have the time and motivation to write a detailed post that would be much welcome by me and many others as well I am sure.

Sure! I can probably write something up over the holidays.

Pacemaker/Corosync is a good tool if you know exactly how it works. However, very few people know exactly how it works. This is a very old problem for that team if you check the mailing lists. If they offered a few "blessed" configurations or even an online config generator, I'm sure that people would have nicer things to say about it.

I've used it heartbeat/pacemaker/corosync with DRBD for over a decade and I'm pretty pleased with it...now. But it took a bit of trial by fire to get it right. Luckily I never lost any data and found the issues in testing. Which gets at the heart of any failover mechanism -- it's all useless unless you test it on a regular basis (just like your backups).

I'd generalize this to many kinds of failover setups, DB or non-DB. Practically each and every one I've seen is in itself less reliable than the thing it's covering for.

It amazes me that in 2017 HA for RDBs is still so brittle and complicated. I had hoped this problem would have been solved by now!

Although I guess that's what the guys at CockroachDB are trying to do (for PostgreSQL at least!)

I’ve never in that time seen a pacemaker/corosync/etc/etc configuration go well. Ever.

That mirrors my experience - PG, DRDB, anything else, this form of clustering is a complete joke. If it's important, it's worth spending the money on a grown-solution. Veritas has its flaws sure but in 20+ years on half a dozen different OS's it's never let me down when the balloon went up.

your findings are supported by Baron Schwartz as well (https://www.xaprb.com/blog/2012/09/17/is-automated-failover-... and others)

I think it's relevant that Baron's post mentions a Github incident, and Github has been doing automatic failover of MySQL masters for a while now: https://githubengineering.com/orchestrator-github/

This post is also very relevant to the topic: http://code.openark.org/blog/mysql/mysql-high-availability-t...

The post you link is still relevant to this discussion, but we can't ignore that 5 years have passed. It would be good to see 2017's Baron revisit the topic :)

How does pgbouncer make this process easy? Just because there are less connections to go to the final DB? I've also got a random question you might be able to answer (having a hard time googling)... when using pgsql in streaming replication mode, are created functions replicated/updated to everything else as well?

(just learning about postgres and saw an opportunity to ask someone in the know) Cheers!

One reason is that some applications do not properly support reconnecting to the database if the connection is lost. With pgbouncer that is not an issue. Another is to avoid having floating IPs or updating the DNS.

There's nothing wrong with "floating" IP's as long as it's done properly.

I'd assume it's because you can just reconfigure to point pgbouncer to redirect everything to the secondary, rather then having to update all the applications using pgbouncer. It centralizes the configuration of which database is active.

How much effort is required to reconfigure pgbouncer across hundreds of app containers vs moving a single IP? (not using pacemaker/corosync)

Zero. Update configuration in a central location. Push an update to configuration manager agent on each node (using a dead simple gossip protocol). The agent wakes up, pulls new configuration, writes it and reloads pgbouncer. Using an aggressive gossip protocol, the whole process shouldn't take more than 5 seconds on a large cluster, let alone couple of nodes.

That's a fair amount of moving parts. What if one of the agents misses the update and keeps using the old server? What happens during that period of 5 seconds where the hosts are switching? It doesn't sound like an atomic, instant change process.

It's gossip. There is no precise gossip protocol. I've just mentioned one approach. It's not difficult to modify it such that each node gets one message multiple times, to ensure everyone will see the changes. Centrally configuration management systems like Zookeeper/etcd,... will also suffer from the same problem: message propagation takes time. Even with floating IP, a couple of requests may fail until routing tables converge. So there is no atomic configuration switch in either case anyway. The argument was if floating IP is easier than pure software approach to deploy, which is not the case. both of them are equally challenging. One of them requires experienced network engineers, the other one requires experienced distributed system designer. In case of failure the effort to switch from failed node is zero in both approaches.

PgBouncer makes it easy to do a controlled failover without having to deal with floating IP's on your postgresql cluster.

> are created functions replicated/updated to everything else as well?


Note if those functions have an implementation compiled from C, you do need to install the .so on the standbys though.

You need the .so's on the replica's simply for the database structure, but afaik, it's only the on-disk format that's being synchronized, functions are not executed on the replicas. I'm not even sure this would throw errors if the .so's would be missing - but it will give problems when a replica is promoted to the new master.

I agree: it's only because you need them when you fail over.


My understanding is that the problem is not really with pacemaker/corosync. Those tools also are always consistent as ZK/etcd/Consul. There is also SONITH to make sure the node that goes down can't cause damage once it is back.

The problem is not these tools, but implementing what is the right thing to do during an outage or even properly detecting one (what happened with github). Your solution might work 99 cases out of 100 but that remaining 1 case might cause your data loss.

When there is a human required to do the switch it typically he/she can investigate what happened and make the right decision.

It's theoretically possible to have a foolproof solution that always works right, but that's extremely hard to implement, because you need to know in advance what kind of issues you will have, and if you miss something, that's one case where your tool might make a wrong decision.

well corosync/pacemaker is definitly not the same as zk/etcd/consul. STONITH is mostly a bad idea. Two node clusters are actually always a bad idea. Using a VIP is a bad idea, too. This is what I learned in the small scale and in the big scale it's even worse.

The problem in this topic was that they didn't understood corosync/pacemaker correctly. The syntax is akward and it's hard to configure. With consul + patroni they would have a way better architecture that could be way more understood. They would not need a VIP (it would work over DNS). They used archive_command to get a WAL file from the primary on a sync replica. This should NEVER be done, if archive_command did not returned with a sane status code (which in fact it probably did not). They did not read https://www.postgresql.org/docs/10/static/continuous-archivi... at all. Last but not least you should never use restore_command on a sync node when it doesn't need to (always check if master is alive/healty before doing it. Maybe even check how far behind you are)

patroni would've worked in their case. patroni would've made it easy to restart the failed primary. patroni would be in control of the postgresql which is way better than using pacemaker/corosync (especially combined with a watchdog/softdog).

what would've helped also would have been two sync nodes and fail to any of them. (will be harder since sync nodes need to be detached if unhealty)

and best thing is with etcd/consul/zk you could have a cluster of etcd/consul/zk on three different nodes than your 3 database servers (this helps a lot).

It's a little lost in another comment thread (https://news.ycombinator.com/item?id=15862584), but I'm definitely excited about solutions like Patroni and Stolon that have come along more recently.

Well you should definitly look into them. In the past we used corosync/pacemaker a lot (even for different things than just database-ha) but trust me... it was never a sane system. if it ain't broke it worked. if something broke it was horrible to actually get back to any sane state at all.

we migrated to patroni (yeah stolon is cool aswell, but since it's a little bit bigger than we need to we used patroni). the hardest part for patroni is actually creating a script which would create service files for consul (consul is a little bit wierd when it comes to services) or somehow changes dns/haproxy whatever to point to the new master (this is not a problem on stolon)

but since then we tried all sorts of failures and never had a problem. we pulled plugs (hard drive, network, power cord) nothing bad did happen no matter what we did. watchdog worked better than expected in some cases where we tried to fire bad stuff at patroni/overload it. and since it's in python the charactaristic/memory/cpu usage is well understood. (the code is also easy to reason about, at least better than corosync/pacemaker.) etcd/zk/consul is battle tested and did work even that we have way more network partitions than your typical network (this was bad for galera.. :(:() we never autostart a failed node after a restart/clean start. we always look into the node and manually start patroni. and also we use the role_change/etc hooks to create/delete service files in consul and to ping us if anything on the cluster happens.

I am currently using Stolon with synchronous replication for a setup, and overall it's great.

It gives me automated failover, and -- perhaps more imporatantly -- the opportunity to exercise it a lot: I can reboot single servers willy-nilly, and do so regularly (for security updates every couple days).

I picked the Stolon/Patroni approach over Corosync/Pacemaker because it seems simpler and more integrated; it fully "owns" the postgres processes and controls what they do, so I suspect there is less chance to accidentally mis-configurations in the fashion of what the article describes.

I currently prefer Stolon over Patroni because statically typed languages make it easier to have less bugs (Stolon is Go, Patroni is Python), and because the proxy it brings out of the box makes it convenient: On any machine I connect to localhost:5432 to get to postgres, and if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.

In general, the Stolon/Patroni approach feels like the "right way" (in absence of failover being built directly into the DB, which would be great to have in upstream postgres).


Bugs. While Stolon works great most of the time, every couple months I get some weird failure. In one case it was that a stolon-keeper would refuse to come back up with an error message, in another that a failover didn't happen, in a third that Consul stopped working (I suspect a Consul bug, the create-session endpoint hung even when used via plain curl) and as a result some stale Stolon state accidentally accumulated in the Consul KV store, with entries existing that should not be there and thus Stolon refusing to start correctly.

I suspect that, as with other distributed systems that are intrinsically hard to get right, the best way to get rid of these bugs is if more people use Stolon.

> I currently prefer Stolon over Patroni because statically typed languages make it easier to have less bugs (Stolon is Go, Patroni is Python)

Sounds like a holy-war topic :) But lets be serious. How statically typed language helps you to avoid bugs in algorithms you implement? The rest is about proper testing.

> and because the proxy it brings out of the box makes it convenient: On any machine I connect to localhost:5432 to get to postgres

It seems like you are running a single database cluster. When you'll have to run and support hundreds of them you will change your mind.

> if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.

HAProxy will do absolutely the same.

> Bugs. While Stolon works great most of the time, every couple months I get some weird failure. In one case it was that a stolon-keeper would refuse to come back up with an error message, in another that a failover didn't happen, in a third that Consul stopped working (I suspect a Consul bug, the create-session endpoint hung even when used via plain curl) and as a result some stale Stolon state accidentally accumulated in the Consul KV store, with entries existing that should not be there and thus Stolon refusing to start correctly.

Yeah, it proves one more time: * don't reinvent wheel: HAProxy vs stolon-proxy * using statically typed language doesn't really help you to have less bugs

> I suspect that, as with other distributed systems that are intrinsically hard to get right, the best way to get rid of these bugs is if more people use Stolon.

As I've already told before. We are running a few hundred Patroni clusters with etcd and a few dozen with ZooKeeper. Never had such strange problems.

> > if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.

> HAProxy will do absolutely the same.

well I think that is not the same what stolon-proxy actually provides. (actually I use patroni) but if your network gets split and you end up with two masters (one application writes to the old master) there would be a problem if one application would still be connected to the splitted master.

however I do not get the point, because etcd / consul would not allow to still hold the master role which means that the splitted master would lose the master role and thus either die, because it can not connect to the new master or just be a read slave and the application would than probably throw errors if users are still connected to the splitted application. highly depends how big your etcd/consul is and how good your application detects failures. (since we are highly dependent on our database we actually kill hikaricp (java) in case of too many master write failures and just restart it after a special amount of time. well we also look in creating a small lightweight async driver based on akka, where we do this in a little bit more automated fashion.)

> well I think that is not the same what stolon-proxy actually provides. (actually I use patroni) but if your network gets split and you end up with two masters (one application writes to the old master) there would be a problem if one application would still be connected to the splitted master.

On network partition Patroni will not be able to update leader key in Etcd and therefore restart postgres in read-only mode (create recovery.conf and restart). No writes will be possible.

it would be interesting to know how stolon/patroni deal with the failover edge cases and how this impacts availability. like if you accessing the DB but can't contact etcd/consul then you should stop accessing the DB because you might start doing unsafe writes. but this means that consul/etcd is now a point of failure (though, this usually runs multiple nodes so shouldn't happen!). but you can end up in a situation where bugs/issues with the HA system ends up causing you more downtime than manual failover would cause.

you also have to be careful with ensuring there is sufficient time gaps when failing over to cover the case when the master is not really down and connections are still writing to it. like the patroni default haproxy config doesn't even seem to kill live connections which seems kind of risky.

> if you accessing the DB but can't contact etcd/consul then you should stop accessing the DB because you might start doing unsafe writes.

If patroni can't update leader lock in Etcd, it will restart postgres in read-only mode. No writes will happen.

> like the patroni default haproxy config doesn't even seem to kill live connections which seems kind of risky.

That's not true: https://github.com/zalando/patroni/blob/master/haproxy.cfg#L...

Ah. Thanks. I was looking at the template files but I guess that is not used or used for something else.

Thanks for the extra info, and the insight into how you're using Patroni. Always helpful to hear about someone using it for real, especially someone who's come from Pacemaker. :)

Patrons is great. Running dockerised Postgres with consul backend for years without a hitch. Haproxy as lb. What that? A replica need reboot. Just reboot. Primary? Just failover to replica and reboot. Undesired reboots recovers in under 10 seconds. During which just primary is not available but replicas are.

>Fortunately, as part of some unrelated work we'd done recently, we had a version of the cluster that we could run inside Docker containers. We used it to help us build a script that mimicked the failures we saw in production. Being able to rapidly turn clusters up and down let us iterate on that script quickly, until we found a combination of events that broke the cluster in just the right way.

this is the coolest part of this story. Any chance these scripts are opensource ?

We plan to open source it as soon as we can. Tiny bit more work to do, then review from a couple of people in the team, then we make it public. :)

Thanks for this !

Since I am investigating HA with PostgreSQL right now and have bitter experience of Pacemaker 'HA' instances that have been anything but, I am looking at Amazon Aurora and Microsoft's (in preview) Azure database for PostgreSQL offerings. I would really appreciate any insight from others who are already using them (we intend to do some PoC work shortly).

Our dev team also came up with some pertinent questions, which we have put to both companies, but if anyone else can comment from experience that would be fantastic:

* Is the product a fork of PostgreSQL or a wrapper round the current version?

* Will the DB engine keep in lock-step with new PostgreSQL releases or might they diverge?

* If the DB engine keeps in lock-step, what’s the period between a new version of PostgreSQL being released before its incorporated in the live product?

* When new versions of Amazon Aurora/Azure DB for PostgreSQL are released will our live instance get automatically updated or will we be able to choose a version?

I can comment on Amazon Aurora:

> Is the product a fork of PostgreSQL or a wrapper round the current version?

Aurora is a fork: they've re-written a significant chunk of the engine. Note that Amazon also offers RDS PostgreSQL, which is a managed version of the "regular" PostgreSQL engine. RDS PostgreSQL also offers a HA setup (no version upgrade without downtime, however). It works quite well.

> Will the DB engine keep in lock-step with new PostgreSQL releases or might they diverge?

Amazon promises to keep it in lock-step. How soon they will release an upgrade to a major version remains to be seen.

> When new versions of Amazon Aurora/Azure DB for PostgreSQL are released will our live instance get automatically updated or will we be able to choose a version?

Minor version upgrades are applied automatically. For major version upgrades, it's unclear at this time (there hasn't been one yet for Aurora PostgreSQL), but I think it's unlikely they will be applied automatically.

I think the root issue is that PostgreSQL does not offer an HA solution that works out of the box with minimal configuration, resulting in people using broken third-party ones and/or configuring them incorrectly.

They should either provide one or "bless" an external solution as the official one (after making sure it works correctly).

The other problem is that GoCardless setup an asynchronous and a synchronous replica instead of 2 synchronous replicas (or preferably 4+), resulting in only two points of failure, which is not enough.

Pretty much number 1 on my Postgres wishlist.

I cannot upvote this enough... it's the single most glaring feature missing from PostgreSQL at this point. I know it's where all the support companies are making their money, but it really keeps the majority of SMBs from adopting postgres. As much as I hate mySQL.

Makes me sad that running your own instances is now an "elephant in the room." No pride in old-school do-it-yourself nerditry these days :/

Getting HA right is hard. DIY-ing it incurs risk, possibly deliberately, out of Not-Invented-Here-ism.

Source: PostgreSQL DBA for over a decade; have built multiple HA environments; have seen many ways of "doing it wrong", and how those can end up biting their creators.

On the other hand:

With hosted Postgres, when a failure does happen, isn't it much harder to get at the log files? They seem extremely useful to diagnose the problem and make sure it doesn't happen again, as the article shows. What's your experiene here, can you get at logs easily with hosted Posgres offerings?

And it seems the only way to get reliable Postgres HA for everyone, and to weed out the bugs, is if more people run Posgres HA themselves. For example, I find Stolon and Patroni great, but I would be more relaxed about them if they had 100x more users.

We aren't using hosted postgres (much, yet). We provision EC2 instances and self-manage it. Failover is scripted, and manually invoked as needed.

None of us trust any of the automated failover solutions enough to use them. We want human judgement in that loop, even if it means being woken at 3AM to push the button. It's that hard to get right.

Just one incident like The Fine Article's is well more than our entire infrastructure's total downtime for the rolling year, and we have hundreds of postgres instances.

Done wrong, automated failover is a net increase in risk. And, in case my thesis is somehow unclear, it's hard to get right.

It's not hard. The problem is operation team rarely exercise failures. Configure a test HA cluster in lab, and test it. If it works, push it to production. The production system, should be continuously tested with real failures to see whether fail over mechanisms actually taking over the responsibility or not.

OF COURSE every thing is going to work in the lab, BUT MAY BE there is some other corner case in the production that you haven't considered yet. --- Louis C.K. after a sleepless night of switching primary DB to secondaries.

It's still "respectable" to run your own n=1 Postgres instance, maybe with WAL-E backup. It's sensible, as well, to create your own read-replicas to scale OLAP; and even to do your own shared-nothing sharding across regions. These are all "set and forget" enough that they can be the responsibility of your regular devops.

But, when you get to the point where you need multi-master replication, you're making a mistake if you aren't dedicating an ops person (i.e. a DBA) to managing your growing cluster. If you can't afford that ops person, much better to just pay a DBaaS provider to handle the ops for you, than to get hosed when your cluster falls apart and your week gets shot putting it back together.


A thing that scares me is anyone saying they're running their own HA cluster (not single instance) for cost reasons. Infra people are not cheaper than the hosted solutions (Amazon RDS, Google Cloud SQL, Heroku Postgres).

That’s a blanket statement that has very little basis in reality. Hosted Postgres is never going to give you the performance you need for low latency deployments.

Oh for sure!

My claim is that you need to hire some expensive people if you want that performance, not that there aren't reasons to run your own database instances!

RDS i3 on bare-metal is in preview now, so it's not too far off


Then you are running it yourself for latency reasons, not (just) the cost reasons in the GP's scenario.

Cause AWS infra is managed by magic monkeys.

I love nerding out over this personally, but if you're a startup, given the plethora of well managed offerings, you're frankly foolish to invest resources on this. Even if you eventually reach the point where it makes financial sense to hire a full time ops person or DBA, the opportunity cost of having a smart engineer (and it does take a smart engineer to manage a multi-master database) work on infrastructure instead of your actual product, is just stupid.

How many startups have failed because they spent too much money building "cool, nerdy, infrastructure" instead of just building a product?

The danger here is differentiating between "the infrastructure" and "the product". Useful database and/or infrastructure work _is_ "building the product".

There's nothing necessarily wrong with using pre-baked or hosted components when they fit the bill, but pretending like they're unrelated concerns is going down a bad road. A lot of recent fads are based on this self-centered, lazy fantasy from devs that $LANGUAGE_OF_THE_MONTH is the only thing that matters and it's a dark, sad situation.

There's a pretty consistent inverse relationship between technical quality and popularity because time and money spent on technical/engineering resources is time and money not spent on marketing and sales resources that bring cash in the door.

Ever wonder why, with a few exceptions, it never seems that the products everyone knows about are comparable to what you can find after a little bit of research online? This is why. The people who are building good stuff are spending the time and resources on building good stuff, whereas the people who aren't are spending the time and resources on making sure they're the path of least resistance.

So in that sense, yes, you are right. It is dumb to spend any time or money on anything other than the bare minimum skeleton needed to allow your sales people to start pimping your stuff.

Whether or not people recognize your product's superiority is more or less irrelevant, because first, they won't, and because second, the extra effort it takes to swim upstream and use your product instead of the mainstream solution won't really be worth the gains for most people no matter how much better it is. You can probably rattle 15 examples of software off the top of your head that is just like this. PostgreSQL is actually a great example of it.

Amazon has run amok feeding people who don't really deserve the title "developer" a load of crap about how you can click buttons in their wizards and be like a super-real grown-up coder-hero without any having to learn any of that outdated command line mumbo-jumbo. It's 2017 after all! Don't worry about that gobbligook hocus-pocus that the smelly old man in the network closet keeps muttering under his breath. That's for smelly old people and third-party Amazon contractors. You have Very Important JavaScript to write, just as soon as you finish dragging Legos--err--"Mega-Elastic Dynamo-tastic Sumerian-Beanstalkinator Units" around on AWS.

How have other professions handled this issue? After all, most people wouldn't know the difference between a safe bridge and an unsafe one, and most people wouldn't know the difference between safely-prepared food and unsafely-prepared food (until they've already eaten it). The profit incentive is to put the bare minimum in place and then sell sell sell.

We may not like the heavy hand of regulation that will clamp down on the software industry, permanently and officially gate it behind the blood-sucking ivory tower of the academic priesthood, and strip it of all vitality and creativity, but with the attitudes that have become prevalent over the last few years, we have no one to blame but ourselves.

Your comment make me warm and fuzzy, all I can see these days is $UNNEEDED_ADDED_COMPLEXITY. People genuinely want to get their jobs done but they don't pause for the first second to analyze if that extra lib is gonna pull a bunch of other dependencies which might break the whole thing in a million different ways down the road.

Next month a new super magical JS router component-ized and gulp-ified sass-y library is gonna come out and then I'm gonna be again in that position of the "old guy in the shop who is always convincing everyone not to upgrade to the bleeding edgiest version available". And I feel it's an epidemic.

But seriously, what value does the infrastructure administration provide? If my SaaS app is built with JavaScript, why should I waste any time at all managing PostgreSQL WAL replication when I could be adding a new feature that $BigCo will pay me for?

AWS, Azure, and GCP give developers the ability to only worry about their code, and all the hard stuff like database replication, load balancing, security, secrets management, container orchestration and code deployment are handled for them. Linking lego blocks together doesn't make them a coder-hero, it just makes them more productive because the other 80% of the job is done for them, by a cloud provider that already knows how to do it the right way, at planet scale.

Like I said at the top, it's not that hosted or pre-baked solutions are bad, but it's the attitude that it's "frankly foolish" to expend any effort on it when we could blindly trust $cloudOverlord instead.

You have to know how things work at a reasonably detailed level to know whether or not $cloudOverlord's solution is appropriate or not. If you know that, then you can make an informed decision as to whether or not it's better to go with them, and the reality is that in many cases, there's no real reason to prefer $cloudOverlord's solution. It is, quite often, very expensive, not to mention more complex and entrenching oneself further into dependence on a third-party over which they have no influence or control, and whose business model is finding new ways to charge them [more] rent. It also frequently constricts the availability of patches, features, configs, and upgrades that would be available and useful in a self-administered setup.

As for planet scale, well, there was a deploy on our 100% AWS-backed infrastructure last week that went horrendously and everyone had to pull all-nighters through the weekend trying to troubleshoot the performance problems. "Planet scale" is not something automatically granted by paying through the nose for AWS. Like it or not, you have to have someone who knows what they're doing to get good performance at scale. (Our issues were caused primarily by management's refusal to accept this and preference to believe that just waving the money stick at Amazon would make all problems disappear, since AWS is a super-neato thing from a "planet scale" company.)

The issue that's become very blatant over the last few years is that you get a lot of people who assume that anything except the code they've personally written is a magical fairy box that does everything they want automatically, and then they get mad when they learn that in fact, you still have to understand the tools reasonably well just to use them properly, let alone to debug or troubleshoot issues that may be occurring within them.

Engineering time spent understanding, formulating, and composing the core building blocks of a product is likely to be more important to a project's lifecycle than the time spent writing easily-replaceable business logic in the top layer of the application.

That people are so eager to outsource these fundamental building blocks not out of simple technical expediency ("they're a better WAL wrangler than me and it will take less time for them to do it") but rather out of a sentiment that it's "stupid" to commit the time of "smart engineers" to infrastructure administration and/or design is extremely frustrating.

I understand that within the context of a startup, the VCs want the barebones version to sell ASAP so that they can "test the market" before they give their early-20s sucker-- uh, founders -- more money to burn. Some people may have extrapolated this impulse outside of any context in which it could be considered either responsible or reasonable.

This is why you should be extremely wary of anything that is only run once in a a blue moon. And very wary of such things that when run, are being run to save your bacon.

I always advocate for monthly "pull the plug on the box" tests.

If you don't need "high availability", then it will test your backups and restore process, and if you do need "high availability", it will ensure your failover processes are running smoothly.

Not to mention it trains everyone involved what to do in an emergency since it should be second nature by the time it really happens.

If you can't go "Full netflix" and unleash a chaos monkey on your servers, at least setup a maintenance period where downtime is somewhat expected, and do it then.

> This is why you should be extremely wary of anything that is only run once in a a blue moon.

I find this similar to when you launch a project that hasn't been used in production yet. Bugs should be expected because it hasn't been battle tested.

"Your backups are only as good as their last successful restore."

And wary of using options that might be outside typical usage (such as -INF for the backup VIP).

I don't see how using infinity in a pacemaker config would be outside typical usage

"The RAID controller logged the simultaneous loss of 3 disks from the array. All subsequent read and write operations against it failed."

People seem to forget that adding a RAID controller creates a single point of failure instead of removing one. :-)

> People seem to forget that adding a RAID controller creates a single point of failure instead of removing one. :-)

At worst it does both. In most cases it really does just remove a single point of failure (a disk). Other non-RAID configurations likely use a shared controller too. Moving that single point of failure to a different controller doesn't make it any worse.

I suppose it depends on which is more reliable, HDDs or the controller. If the likelihood of a HHD failure is much higher than a controller failure, then it still makes sense to go with RAID.

Always use software RAID. I'd rather loose 30% on efficiency than creating a single point of failure with a random hardware. Software RAID on modern OSes (Linux, FreeBSD, ... ) are pretty darn reliable and fast.

And if your OS breaks you can always boot from another one and recover your cluster.

Now try that with an obscure Adaptec card which has been EOL for the last 4 years :-)

And the SATA/SAS controller of your Mainboard is invincible? Or the Mainboard itself? Or the CPU? What's your backup solution if they are faulty?

I use two different controllers in JBOD mode. Each controller manages half of HDDs. Then I build cross controller software RAID 1. That is, I choose two disks from separate controllers to build raid 1. Last but not least, all HDDs are mixed from different vendors across controllers.

For CPU, RAM there is no other option. you have to replicate stuff.

Pacemaker is known to wreak havoc if it gets angry. The usual path to quick recovery when the cluster goes crazy like this is to make really sure what's the most up to date replica, shut down Pacemaker completely, assign VIP manually to a healthy replica and promote it manually. Then once you're up and back in the business figure out how to rebuild the cluster.

If this is indeed true, doesn't this negate the purpose of pacemaker to begin with? It's like anti-software. When you run with it in your environment, to recover from a failure (which seems to me what HA software should be about) you have to turn it off first or else it will destroy your recovery attempts.

It's like a perverse version of chaos-monkey, except you want it to destroy you when you are most vulnerable.

It's great when it works as expected. When it doesn't... then the fun begins. I've found it quite fragile, components versions sensitive, configuration sensitive, etc. Most of the time I've seen Pacemaker gone crazy Pg itself was happy to cooperate once the Pacemaker was out of the way. The unknown/weird Pacemaker failure modes were a real (and scary) problem.

I guess the lesson here is not to rely entirely on some HA black magic and always have procedures in place for the 'HA black magic failed us' moments. And team trained to deal with situation like this. It's only software so it will break sooner or later.

Unfortunately appropriate project name? (A maulfunctioning pacemaker can kill you)

Good write-up. I'm curious about two more things:

1. What caused the crash on the synchronous replica? Was it just a coincidence and completely unrelated to the primary failure?

2. Given the three conditions necessary for the cluster to break, was the behavior of the Pacemaker software expected? I.e., was this a gotcha that should be in the Pacemaker documentation, or a bug?

1. Unfortunately the logs don't give any detail there. Most likely something arrived down the replication connection that the process couldn't handle, and it crashed.

2. Our understanding now is that INF is the strongest preference, whereas -INF is a veto. It would be very cool to have this confirmed 100% by someone who works on Pacemaker!

The end of this post mortem was a bit handwavy TBH. I feel like they didnt dig deep enough, and the problem was the backup VIP, not the two processes crashing at once and the backup VIP.

I think by still allowing the backup VIP to run on the sync replica the same mistake is being repeated, there will always be the possibility of a situation where the VIP cannot be moved when promotion is required. That replica should be doing nothing but sitting there waiting to save the day, and if they want the backup VIP to be highly available they should provision 2 async replicas.

I too am coming up on a need for no-downtime HA failover for Postgres. I too am not allowed to use a hosted PaaS-ish solution like RDS. I was considering Citus's multi master impl (I don't need to spread the load, just need HA). I had not considered Pacemaker. Has GoCardless investigated this option and have any insight to give? HA has traditionally been a real pain point for traditional RDBMS's in my experience.

Craig from Citus here. Unfortunately, at this time Citus isn't really focused solely on solving the HA implementation for single node Postgres. Rather, we're focused on when you need to scale for performance issues. Our multi-master setup is targeting use cases that need higher throughput of 500,000+ single wrote inserts per second or say higher than 5 million writes per second when using ingestion with Postgres \copy.

To be honest we've not looked into Citus in any depth.

My early impression of it (can't speak for the rest of the team) was that it was mostly aimed at sharding analytics workloads, but parts of the docs (e.g. https://docs.citusdata.com/en/v7.1/admin_guide/cluster_manag...) make it sound like it handles OLTP workloads too.

Maybe I've been ignoring it for bad reasons!

EDIT: Managing Postgres clusters is something that a lot of people are working on. Thought I'd mention two projects that have me excited right now:

  - Patroni https://github.com/zalando/patroni
  - Stolon https://github.com/sorintlab/stolon
Stolon's client proxy approach in particular looks interesting, and reminds me of how people are using Envoy (https://github.com/envoyproxy/envoy), albeit as a TCP proxy rather than one that understands and can do fun stuff with the database's protocol. I wonder if we'll start to see more Envoy filters for different databases!

Craig from Citus here. Since we grew transactional support a couple of years ago and a number of the features we've supported since then much of our traction has come from those outgrowing single node Postgres and needing more performance. So in short we're very much focused on handling and supporting OLTP workloads.

We do also support some analytics workloads, less so data warehousing, when there is a need for end user facing analytics where higher concurrency and real-time responsiveness is key.

We’ve been using Patroni in production and it has been great. We use it with consul & pgbouncer and it can failover in under a minute with a small number of dropped requests (mostly bound by how many clients your pgbouncer can hold at once while the new master gets going). Controlled failover for upgrades or maintenance can be as quick as 10 seconds.

When deciding to go with Patroni, did you have a look at CruncyDB? We're deciding between the two and kubernetes support on CrunchyDB and documentation seems to be more comprehensive.

It looks like you are talking here about postgres-operator by Crunchy

Yeah, crunchy was a little faster with releasing it than Zalando, but try you look closer how they deploy postgres on kubernetes. Somehow it feels that they are trying to map 1 to 1 the same approach how folks used to run postgres on bare metal. That is: deploy master pod, wait until it's up and running, deploy a replica pod, and so on...

It doesn't really look cloud-k8s-native.

In my opinion such deployment should look absolutely different. You just need to deploy k8s manifest, which will create Secrets, StatefulSet and Service which will be used to connect to the master. The rest should happen automatically: * StatefulSet will start N pods with postgres * pods (Patroni) will elect leader. * elected leader will initialize (initdb) a new cluster * all other pods will get basebackup from the leader and become replicas * if the master(leader) pod die - other pods will elect a new leader * StatefulSet will start a replacement of failing pod and it will join the cluster as a new replica

And more important all it should happen without connection to Etcd, ZooKeeper or Consul. It should just use Kubernetes API.

Basically all this "magic" is already supported by Patroni https://github.com/zalando/patroni/pull/500

We are planning to merge this PR today or tomorrow, and it will make Patroni+PostgreSQL first class citizen on Kubernetes.

Thanks for the detailed reply. I see your PR has been merged. The StatefulSet support does look great. Are there any advantages over the CrunchyDB StatefulSet setup here: https://github.com/CrunchyData/crunchy-containers/tree/maste...?

Look at HAProxy, Patroni and Zookeeper

I'm told that MySQL replication blows Postgres out of the water by my company's data team, but they could just be biased since that is their area of expertise. I work on server code and don't really have much familiarity with the operations of running replica chains.

Postgres seems like a better choice for personal projects since it has a lot of nifty features. I'm also wary of Oracle, but that's my own attitude talking. For a startup eventually wanting to scale, would the better choice be to use MySQL out of the gates? Am I being mislead about Postgres clusters and availability?

Serious (naive) question; not wanting to start a flame war.

MySQL has had replication for _longer_, but I would the MySQL team has a history of releasing half-baked functionality with a long list of gotchas. PostgreSQL has only had replication in the core product for the past couple of versions, previously you had to rely on third party tools which had some interesting behaviours, or that were entirely commercial.

Since PostgreSQL 10 WAL and logical replication strategies can support just about any sort of replication you desire, except multi-master.

PostgreSQL's replication itself is perfectly fine, albeit a bit bare-bones (e.g. it's just replication plus the ability to trigger a failover). For something like cluster management and automated failovers you'll need e.g. https://repmgr.org/.

I think for a while MySQL had a much better replication story but that has changed now with logical replication now in postgresql. What most concerns me about MySql is how the configuration can shoot you in the foot. At my previous company the slave replica for a database was writable and I'm not even sure how this is a valid configuration :/ and of course someone ended up running a query on the slave and corrupting its data.

I found configuring an HA setup easier to do with MySQL. The ability to configure a master <-> master setup simply is really helpful. Of course, having writes on both sides is generally a bad idea, but it's quite simple to have master - hot standby setup, and not go through the slave promotion step.

An HA setup can be a master <-> master setup with a VIP using VRRP and a check script ( keepalived). Of course, you have to remain cautious about network partition. Another thing that might be interesting in some use cases is the ability to skip some replication errors. This is specially interesting in cases where consistency is not critical.

I actually did some setup like that with a replication ring (4 full masters), and an additional daemon re-configuring the ring dynamically when a node was down. It also monitored the transaction log, trimming them if they were about to fill up the disk and setting the GTIDs to the new values. I added some skip error to not block replication. However it was for a very simple DB (session DB containing just one table, but an SQL db was required). Basically I switched from a CA DB to an AP DB, and it's nice to be able to do these kind of things.

I know those are too simplistic setups not taking into account all the failure modes. But it also make them easier to understand and to debug.

Having ran MySQL in prod for a decade and PostgreSQL in prod for half a decade I can say without doubt that your data team is telling fibs.

Firstly we consider that there are multiple replication possibilities of both technologies- however I'm going to assume the defaults because that's pretty much what everyone uses except if there's an actual case for using something else. It's the exception.

But by default MySQL uses statement based replication (in a weird binary format with log positions and stuff) and postgresql does logical replication (as in, you transmit the binary differences of what you'll be doing to the replica's database files directly and the replica just follows along)

Both of these approaches have trade-offs depending on what you want.

Statement based replication is great if you want to have _different_ datasets on each side, You can transform the data or remove huge chunks of it on a slave and use it for a dedicated purpose. However that applies the other way, you can never really be 100% sure that your replica looks exactly like your master.

this bit me a few times with MySQL when I assumed that because the replica was 'up to date' with the master and it was set to read only, that the data had integrity- it absolutely did not.

I don't think the claim of MySQL replication being better is related to statement vs. row vs. binary diff. I think it is about the tooling and community knowledge about replication, and about running MySQL in large scale production environments in general.

MySQL is run more often at extremely large scale (Facebook, YouTube, Twitter, Dropbox, etc.) than Postgres. That results in very battle tested and/or featureful tooling like orchestrator (https://github.com/github/orchestrator), MHA (https://github.com/yoshinorim/mha4mysql-manager), ProxySQL (http://www.proxysql.com/), and gh-ost (https://github.com/github/gh-ost), along knowledge and best practices shared by those organization.

What are anybody's real-world use case for proxysql?

I have been toying with the idea of using it for local devs to access our prod DB for reads (for accurate data) and using a local DB on their machines for writes.

Not sure how to handle UPDATEs, though.

Use cases for ProxySQL: many.

- failover

- query routing (e.g., for sharded deployments)

- caching

- workload stats/metrics

- query rewriting


Having also supported hundreds of production MySQL databases, statement based replication is absolutely inferior. But it should also be noted that row based replication (similar to streaming replication in that the actual data changes are synced) has been supported in MySQL since v5.1.5 (current is v5.7). And row based replication is the default since v5.7.7 https://dev.mysql.com/doc/refman/5.7/en/replication-formats....

That's something I didn't know actually, so +1 to you.

I've been migrating away from MySQL for a number of years now.

Did you mean "physical replication"? Logical replication corresponds to MySQL's model, I think, whereas WAL streaming is just copying over the changed bytes as they get written to the WAL on the master database.

I like Postgres's physical replication for its straightforwardness. It's pretty easy to tell if your replica is up to date unless something really weird is going on. (undetected data corruption?).

That said, PostgreSQL doesn't really make replication appear easy, so I can understand people thinking that even a basic master-slave setup is difficult (In my experience its behaviour is much easier to understand than with MySQL). However, MySQL is ahead in multi-master user friendliness, and setting up eg. a simple galera cluster is pretty easy.

Whether an "easy" multi-master galera set up is actually production-quality is another matter entirely, but it is not difficult to get up and running.

> Did you mean "physical replication"? Logical replication corresponds to MySQL's model

nope postgresql supports logical replication since 10. https://www.postgresql.org/docs/10/static/logical-replicatio...

> Whether an "easy" multi-master galera set up is actually production-quality is another matter entirely, but it is not difficult to get up and running.

if you have regular network partitons (actually having network paritions is always the case, especially inside clouds) than a galera cluster can actually broke in several cases that are even worse than any failure even the most broken replication on postgresql/mysql non galera can do.

There are specialized tools like Patroni (https://github.com/zalando/patroni)

If you are running HA in AWS RDS, how would you compare your experience with the above? What are the types of RDS failures modes that you have experienced?

So far I've discovered that TCP keepalives are quite important, otherwise your queries may hang forever after failover (or at least for the default timeout which is like 30 minutes). The connection does not get broken otherwise by the failover.

Here are a couple gotchas I've seen on RDS:

- If you are running a MultiAZ instance, it is supposed to fail over automatically, but if the problem is in the networking, then you can still effectively lose service. One way around that is to run a read replica in another AZ, and use a Route53 entry with a health check to send traffic to the read replica if the primary isn't reachable. You'll still need to promote the read replica to a master though.

- If you restore from a snapshot, the new EBS volume only pulls blocks of data from S3 as they are requested. So these reads are a lot slower than normal. If you have a large database you could have degraded performance for days. Here is some more info about this: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-resto...

I am astonished that, in the two years, you had not already handled 100+ scheduled failovers. If your HA is good, customers don't notice, and if not, you find out when there are fewer of them (and in daytime!), and fix it.

Probably by now Pacemaker would have been abandoned. A hundred drills would have been enough to flush out these behaviors. If you are afraid to run drills on production equipment, you should be running them on a full-scale production testbed, ideally with mirrored production traffic. With a production-scale testbed, two years is enough to run thousands of risk-free failovers.

Not doing frequent production failure drills is just irresponsible.

I'm seeing the term "HA" being used a lot in the comments here, what does it mean?

high availability

Stolon with etcd or consul is a far superior solution for HA postgres.

Stop pretending that there's a magic bullet called "multi-master" and "transparent promotion". Your apps are super simple. Their DB interactions are super simple. Learn how to do federations and all these problems will go away.

It's curious that you decry some kinds of "magic bullet" in favor of another.

How about "there are no magic bullets, full stop"?

Interesting. Recommended reads?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact