I’ve never in that time seen a pacemaker/corosync/etc/etc configuration go well. Ever. I have seen corrupted DBs, fail overs for no reason, etc. The worst things always happen when the failover doesn’t go according to plan and someone accidentally nukes the DB at 2am.
The lesson I’ve taken from this is it’s better to have 15-20 minutes of downtime in the unlikely event a primary goes down and run a manual failover/takeover script then it is to rely on automation. pgbouncer makes this easy enough.
That said, there was a lot of bad luck involved in this incident.
If we want to failover (nothing automatic), we stop the primary (or it's already dead) and the route is withdrawn. An operator touches the recovery file on new primary, the health checker sees that, and the IP is announced back into the network. Yes, it's a "VIP", but it's one controlled by our operations team, not automation software. One nice things about this is that you can failover across datacenters (remember it's advertised into our network over BGP) without reconfiguring DNS or messing with application servers.
While the mechanisms are different, we do something very similar for MySQL with MHA. It's still an operator running scripts intentionally though (which is what we want).
I will definitely agree with you that manual operator intervention is better than automated failover.
If you have the time and motivation to write a detailed post that would be much welcome by me and many others as well I am sure.
I've used it heartbeat/pacemaker/corosync with DRBD for over a decade and I'm pretty pleased with it...now. But it took a bit of trial by fire to get it right. Luckily I never lost any data and found the issues in testing. Which gets at the heart of any failover mechanism -- it's all useless unless you test it on a regular basis (just like your backups).
That mirrors my experience - PG, DRDB, anything else, this form of clustering is a complete joke. If it's important, it's worth spending the money on a grown-solution. Veritas has its flaws sure but in 20+ years on half a dozen different OS's it's never let me down when the balloon went up.
This post is also very relevant to the topic: http://code.openark.org/blog/mysql/mysql-high-availability-t...
The post you link is still relevant to this discussion, but we can't ignore that 5 years have passed. It would be good to see 2017's Baron revisit the topic :)
(just learning about postgres and saw an opportunity to ask someone in the know)
Note if those functions have an implementation compiled from C, you do need to install the .so on the standbys though.
The problem is not these tools, but implementing what is the right thing to do during an outage or even properly detecting one (what happened with github). Your solution might work 99 cases out of 100 but that remaining 1 case might cause your data loss.
When there is a human required to do the switch it typically he/she can investigate what happened and make the right decision.
It's theoretically possible to have a foolproof solution that always works right, but that's extremely hard to implement, because you need to know in advance what kind of issues you will have, and if you miss something, that's one case where your tool might make a wrong decision.
The problem in this topic was that they didn't understood corosync/pacemaker correctly. The syntax is akward and it's hard to configure.
With consul + patroni they would have a way better architecture that could be way more understood.
They would not need a VIP (it would work over DNS).
They used archive_command to get a WAL file from the primary on a sync replica. This should NEVER be done, if archive_command did not returned with a sane status code (which in fact it probably did not).
They did not read https://www.postgresql.org/docs/10/static/continuous-archivi... at all.
Last but not least you should never use restore_command on a sync node when it doesn't need to (always check if master is alive/healty before doing it. Maybe even check how far behind you are)
patroni would've worked in their case. patroni would've made it easy to restart the failed primary.
patroni would be in control of the postgresql which is way better than using pacemaker/corosync (especially combined with a watchdog/softdog).
what would've helped also would have been two sync nodes and fail to any of them. (will be harder since sync nodes need to be detached if unhealty)
and best thing is with etcd/consul/zk you could have a cluster of etcd/consul/zk on three different nodes than your 3 database servers (this helps a lot).
we migrated to patroni (yeah stolon is cool aswell, but since it's a little bit bigger than we need to we used patroni).
the hardest part for patroni is actually creating a script which would create service files for consul (consul is a little bit wierd when it comes to services) or somehow changes dns/haproxy whatever to point to the new master (this is not a problem on stolon)
but since then we tried all sorts of failures and never had a problem. we pulled plugs (hard drive, network, power cord) nothing bad did happen no matter what we did. watchdog worked better than expected in some cases where we tried to fire bad stuff at patroni/overload it. and since it's in python the charactaristic/memory/cpu usage is well understood. (the code is also easy to reason about, at least better than corosync/pacemaker.) etcd/zk/consul is battle tested and did work even that we have way more network partitions than your typical network (this was bad for galera.. :(:()
we never autostart a failed node after a restart/clean start. we always look into the node and manually start patroni. and also we use the role_change/etc hooks to create/delete service files in consul and to ping us if anything on the cluster happens.
It gives me automated failover, and -- perhaps more imporatantly -- the opportunity to exercise it a lot: I can reboot single servers willy-nilly, and do so regularly (for security updates every couple days).
I picked the Stolon/Patroni approach over Corosync/Pacemaker because it seems simpler and more integrated; it fully "owns" the postgres processes and controls what they do, so I suspect there is less chance to accidentally mis-configurations in the fashion of what the article describes.
I currently prefer Stolon over Patroni because statically typed languages make it easier to have less bugs (Stolon is Go, Patroni is Python), and because the proxy it brings out of the box makes it convenient: On any machine I connect to localhost:5432 to get to postgres, and if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.
In general, the Stolon/Patroni approach feels like the "right way" (in absence of failover being built directly into the DB, which would be great to have in upstream postgres).
Bugs. While Stolon works great most of the time, every couple months I get some weird failure. In one case it was that a stolon-keeper would refuse to come back up with an error message, in another that a failover didn't happen, in a third that Consul stopped working (I suspect a Consul bug, the create-session endpoint hung even when used via plain curl) and as a result some stale Stolon state accidentally accumulated in the Consul KV store, with entries existing that should not be there and thus Stolon refusing to start correctly.
I suspect that, as with other distributed systems that are intrinsically hard to get right, the best way to get rid of these bugs is if more people use Stolon.
Sounds like a holy-war topic :)
But lets be serious. How statically typed language helps you to avoid bugs in algorithms you implement? The rest is about proper testing.
> and because the proxy it brings out of the box makes it convenient: On any machine I connect to localhost:5432 to get to postgres
It seems like you are running a single database cluster. When you'll have to run and support hundreds of them you will change your mind.
> if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.
HAProxy will do absolutely the same.
> Bugs. While Stolon works great most of the time, every couple months I get some weird failure. In one case it was that a stolon-keeper would refuse to come back up with an error message, in another that a failover didn't happen, in a third that Consul stopped working (I suspect a Consul bug, the create-session endpoint hung even when used via plain curl) and as a result some stale Stolon state accidentally accumulated in the Consul KV store, with entries existing that should not be there and thus Stolon refusing to start correctly.
Yeah, it proves one more time:
* don't reinvent wheel: HAProxy vs stolon-proxy
* using statically typed language doesn't really help you to have less bugs
> I suspect that, as with other distributed systems that are intrinsically hard to get right, the best way to get rid of these bugs is if more people use Stolon.
As I've already told before. We are running a few hundred Patroni clusters with etcd and a few dozen with ZooKeeper. Never had such strange problems.
> HAProxy will do absolutely the same.
well I think that is not the same what stolon-proxy actually provides.
(actually I use patroni)
but if your network gets split and you end up with two masters (one application writes to the old master) there would be a problem if one application would still be connected to the splitted master.
however I do not get the point, because etcd / consul would not allow to still hold the master role which means that the splitted master would lose the master role and thus either die, because it can not connect to the new master or just be a read slave and the application would than probably throw errors if users are still connected to the splitted application.
highly depends how big your etcd/consul is and how good your application detects failures.
(since we are highly dependent on our database we actually kill hikaricp (java) in case of too many master write failures and just restart it after a special amount of time.
well we also look in creating a small lightweight async driver based on akka, where we do this in a little bit more automated fashion.)
On network partition Patroni will not be able to update leader key in Etcd and therefore restart postgres in read-only mode (create recovery.conf and restart). No writes will be possible.
you also have to be careful with ensuring there is sufficient time gaps when failing over to cover the case when the master is not really down and connections are still writing to it. like the patroni default haproxy config doesn't even seem to kill live connections which seems kind of risky.
If patroni can't update leader lock in Etcd, it will restart postgres in read-only mode. No writes will happen.
> like the patroni default haproxy config doesn't even seem to kill live connections which seems kind of risky.
That's not true: https://github.com/zalando/patroni/blob/master/haproxy.cfg#L...