I’ve never in that time seen a pacemaker/corosync/etc/etc configuration go well. Ever. I have seen corrupted DBs, fail overs for no reason, etc. The worst things always happen when the failover doesn’t go according to plan and someone accidentally nukes the DB at 2am.
The lesson I’ve taken from this is it’s better to have 15-20 minutes of downtime in the unlikely event a primary goes down and run a manual failover/takeover script then it is to rely on automation. pgbouncer makes this easy enough.
That said, there was a lot of bad luck involved in this incident.
If we want to failover (nothing automatic), we stop the primary (or it's already dead) and the route is withdrawn. An operator touches the recovery file on new primary, the health checker sees that, and the IP is announced back into the network. Yes, it's a "VIP", but it's one controlled by our operations team, not automation software. One nice things about this is that you can failover across datacenters (remember it's advertised into our network over BGP) without reconfiguring DNS or messing with application servers.
While the mechanisms are different, we do something very similar for MySQL with MHA. It's still an operator running scripts intentionally though (which is what we want).
I will definitely agree with you that manual operator intervention is better than automated failover.
If you have the time and motivation to write a detailed post that would be much welcome by me and many others as well I am sure.
I've used it heartbeat/pacemaker/corosync with DRBD for over a decade and I'm pretty pleased with it...now. But it took a bit of trial by fire to get it right. Luckily I never lost any data and found the issues in testing. Which gets at the heart of any failover mechanism -- it's all useless unless you test it on a regular basis (just like your backups).
That mirrors my experience - PG, DRDB, anything else, this form of clustering is a complete joke. If it's important, it's worth spending the money on a grown-solution. Veritas has its flaws sure but in 20+ years on half a dozen different OS's it's never let me down when the balloon went up.
This post is also very relevant to the topic: http://code.openark.org/blog/mysql/mysql-high-availability-t...
The post you link is still relevant to this discussion, but we can't ignore that 5 years have passed. It would be good to see 2017's Baron revisit the topic :)
(just learning about postgres and saw an opportunity to ask someone in the know)
Note if those functions have an implementation compiled from C, you do need to install the .so on the standbys though.
The problem is not these tools, but implementing what is the right thing to do during an outage or even properly detecting one (what happened with github). Your solution might work 99 cases out of 100 but that remaining 1 case might cause your data loss.
When there is a human required to do the switch it typically he/she can investigate what happened and make the right decision.
It's theoretically possible to have a foolproof solution that always works right, but that's extremely hard to implement, because you need to know in advance what kind of issues you will have, and if you miss something, that's one case where your tool might make a wrong decision.
The problem in this topic was that they didn't understood corosync/pacemaker correctly. The syntax is akward and it's hard to configure.
With consul + patroni they would have a way better architecture that could be way more understood.
They would not need a VIP (it would work over DNS).
They used archive_command to get a WAL file from the primary on a sync replica. This should NEVER be done, if archive_command did not returned with a sane status code (which in fact it probably did not).
They did not read https://www.postgresql.org/docs/10/static/continuous-archivi... at all.
Last but not least you should never use restore_command on a sync node when it doesn't need to (always check if master is alive/healty before doing it. Maybe even check how far behind you are)
patroni would've worked in their case. patroni would've made it easy to restart the failed primary.
patroni would be in control of the postgresql which is way better than using pacemaker/corosync (especially combined with a watchdog/softdog).
what would've helped also would have been two sync nodes and fail to any of them. (will be harder since sync nodes need to be detached if unhealty)
and best thing is with etcd/consul/zk you could have a cluster of etcd/consul/zk on three different nodes than your 3 database servers (this helps a lot).
we migrated to patroni (yeah stolon is cool aswell, but since it's a little bit bigger than we need to we used patroni).
the hardest part for patroni is actually creating a script which would create service files for consul (consul is a little bit wierd when it comes to services) or somehow changes dns/haproxy whatever to point to the new master (this is not a problem on stolon)
but since then we tried all sorts of failures and never had a problem. we pulled plugs (hard drive, network, power cord) nothing bad did happen no matter what we did. watchdog worked better than expected in some cases where we tried to fire bad stuff at patroni/overload it. and since it's in python the charactaristic/memory/cpu usage is well understood. (the code is also easy to reason about, at least better than corosync/pacemaker.) etcd/zk/consul is battle tested and did work even that we have way more network partitions than your typical network (this was bad for galera.. :(:()
we never autostart a failed node after a restart/clean start. we always look into the node and manually start patroni. and also we use the role_change/etc hooks to create/delete service files in consul and to ping us if anything on the cluster happens.
It gives me automated failover, and -- perhaps more imporatantly -- the opportunity to exercise it a lot: I can reboot single servers willy-nilly, and do so regularly (for security updates every couple days).
I picked the Stolon/Patroni approach over Corosync/Pacemaker because it seems simpler and more integrated; it fully "owns" the postgres processes and controls what they do, so I suspect there is less chance to accidentally mis-configurations in the fashion of what the article describes.
I currently prefer Stolon over Patroni because statically typed languages make it easier to have less bugs (Stolon is Go, Patroni is Python), and because the proxy it brings out of the box makes it convenient: On any machine I connect to localhost:5432 to get to postgres, and if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.
In general, the Stolon/Patroni approach feels like the "right way" (in absence of failover being built directly into the DB, which would be great to have in upstream postgres).
Bugs. While Stolon works great most of the time, every couple months I get some weird failure. In one case it was that a stolon-keeper would refuse to come back up with an error message, in another that a failover didn't happen, in a third that Consul stopped working (I suspect a Consul bug, the create-session endpoint hung even when used via plain curl) and as a result some stale Stolon state accidentally accumulated in the Consul KV store, with entries existing that should not be there and thus Stolon refusing to start correctly.
I suspect that, as with other distributed systems that are intrinsically hard to get right, the best way to get rid of these bugs is if more people use Stolon.
Sounds like a holy-war topic :)
But lets be serious. How statically typed language helps you to avoid bugs in algorithms you implement? The rest is about proper testing.
> and because the proxy it brings out of the box makes it convenient: On any machine I connect to localhost:5432 to get to postgres
It seems like you are running a single database cluster. When you'll have to run and support hundreds of them you will change your mind.
> if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.
HAProxy will do absolutely the same.
> Bugs. While Stolon works great most of the time, every couple months I get some weird failure. In one case it was that a stolon-keeper would refuse to come back up with an error message, in another that a failover didn't happen, in a third that Consul stopped working (I suspect a Consul bug, the create-session endpoint hung even when used via plain curl) and as a result some stale Stolon state accidentally accumulated in the Consul KV store, with entries existing that should not be there and thus Stolon refusing to start correctly.
Yeah, it proves one more time:
* don't reinvent wheel: HAProxy vs stolon-proxy
* using statically typed language doesn't really help you to have less bugs
> I suspect that, as with other distributed systems that are intrinsically hard to get right, the best way to get rid of these bugs is if more people use Stolon.
As I've already told before. We are running a few hundred Patroni clusters with etcd and a few dozen with ZooKeeper. Never had such strange problems.
> HAProxy will do absolutely the same.
well I think that is not the same what stolon-proxy actually provides.
(actually I use patroni)
but if your network gets split and you end up with two masters (one application writes to the old master) there would be a problem if one application would still be connected to the splitted master.
however I do not get the point, because etcd / consul would not allow to still hold the master role which means that the splitted master would lose the master role and thus either die, because it can not connect to the new master or just be a read slave and the application would than probably throw errors if users are still connected to the splitted application.
highly depends how big your etcd/consul is and how good your application detects failures.
(since we are highly dependent on our database we actually kill hikaricp (java) in case of too many master write failures and just restart it after a special amount of time.
well we also look in creating a small lightweight async driver based on akka, where we do this in a little bit more automated fashion.)
On network partition Patroni will not be able to update leader key in Etcd and therefore restart postgres in read-only mode (create recovery.conf and restart). No writes will be possible.
you also have to be careful with ensuring there is sufficient time gaps when failing over to cover the case when the master is not really down and connections are still writing to it. like the patroni default haproxy config doesn't even seem to kill live connections which seems kind of risky.
If patroni can't update leader lock in Etcd, it will restart postgres in read-only mode. No writes will happen.
> like the patroni default haproxy config doesn't even seem to kill live connections which seems kind of risky.
That's not true: https://github.com/zalando/patroni/blob/master/haproxy.cfg#L...
this is the coolest part of this story. Any chance these scripts are opensource ?
Our dev team also came up with some pertinent questions, which we have put to both companies, but if anyone else can comment from experience that would be fantastic:
* Is the product a fork of PostgreSQL or a wrapper round the current version?
* Will the DB engine keep in lock-step with new PostgreSQL releases or might they diverge?
* If the DB engine keeps in lock-step, what’s the period between a new version of PostgreSQL being released before its incorporated in the live product?
* When new versions of Amazon Aurora/Azure DB for PostgreSQL are released will our live instance get automatically updated or will we be able to choose a version?
> Is the product a fork of PostgreSQL or a wrapper round the current version?
Aurora is a fork: they've re-written a significant chunk of the engine. Note that Amazon also offers RDS PostgreSQL, which is a managed version of the "regular" PostgreSQL engine. RDS PostgreSQL also offers a HA setup (no version upgrade without downtime, however). It works quite well.
> Will the DB engine keep in lock-step with new PostgreSQL releases or might they diverge?
Amazon promises to keep it in lock-step. How soon they will release an upgrade to a major version remains to be seen.
> When new versions of Amazon Aurora/Azure DB for PostgreSQL are released will our live instance get automatically updated or will we be able to choose a version?
Minor version upgrades are applied automatically. For major version upgrades, it's unclear at this time (there hasn't been one yet for Aurora PostgreSQL), but I think it's unlikely they will be applied automatically.
They should either provide one or "bless" an external solution as the official one (after making sure it works correctly).
The other problem is that GoCardless setup an asynchronous and a synchronous replica instead of 2 synchronous replicas (or preferably 4+), resulting in only two points of failure, which is not enough.
Source: PostgreSQL DBA for over a decade; have built multiple HA environments; have seen many ways of "doing it wrong", and how those can end up biting their creators.
With hosted Postgres, when a failure does happen, isn't it much harder to get at the log files? They seem extremely useful to diagnose the problem and make sure it doesn't happen again, as the article shows. What's your experiene here, can you get at logs easily with hosted Posgres offerings?
And it seems the only way to get reliable Postgres HA for everyone, and to weed out the bugs, is if more people run Posgres HA themselves. For example, I find Stolon and Patroni great, but I would be more relaxed about them if they had 100x more users.
None of us trust any of the automated failover solutions enough to use them. We want human judgement in that loop, even if it means being woken at 3AM to push the button. It's that hard to get right.
Just one incident like The Fine Article's is well more than our entire infrastructure's total downtime for the rolling year, and we have hundreds of postgres instances.
Done wrong, automated failover is a net increase in risk. And, in case my thesis is somehow unclear, it's hard to get right.
OF COURSE every thing is going to work in the lab, BUT MAY BE there is some other corner case in the production that you haven't considered yet. --- Louis C.K. after a sleepless night of switching primary DB to secondaries.
But, when you get to the point where you need multi-master replication, you're making a mistake if you aren't dedicating an ops person (i.e. a DBA) to managing your growing cluster. If you can't afford that ops person, much better to just pay a DBaaS provider to handle the ops for you, than to get hosed when your cluster falls apart and your week gets shot putting it back together.
A thing that scares me is anyone saying they're running their own HA cluster (not single instance) for cost reasons. Infra people are not cheaper than the hosted solutions (Amazon RDS, Google Cloud SQL, Heroku Postgres).
My claim is that you need to hire some expensive people if you want that performance, not that there aren't reasons to run your own database instances!
How many startups have failed because they spent too much money building "cool, nerdy, infrastructure" instead of just building a product?
There's nothing necessarily wrong with using pre-baked or hosted components when they fit the bill, but pretending like they're unrelated concerns is going down a bad road. A lot of recent fads are based on this self-centered, lazy fantasy from devs that $LANGUAGE_OF_THE_MONTH is the only thing that matters and it's a dark, sad situation.
There's a pretty consistent inverse relationship between technical quality and popularity because time and money spent on technical/engineering resources is time and money not spent on marketing and sales resources that bring cash in the door.
Ever wonder why, with a few exceptions, it never seems that the products everyone knows about are comparable to what you can find after a little bit of research online? This is why. The people who are building good stuff are spending the time and resources on building good stuff, whereas the people who aren't are spending the time and resources on making sure they're the path of least resistance.
So in that sense, yes, you are right. It is dumb to spend any time or money on anything other than the bare minimum skeleton needed to allow your sales people to start pimping your stuff.
Whether or not people recognize your product's superiority is more or less irrelevant, because first, they won't, and because second, the extra effort it takes to swim upstream and use your product instead of the mainstream solution won't really be worth the gains for most people no matter how much better it is. You can probably rattle 15 examples of software off the top of your head that is just like this. PostgreSQL is actually a great example of it.
How have other professions handled this issue? After all, most people wouldn't know the difference between a safe bridge and an unsafe one, and most people wouldn't know the difference between safely-prepared food and unsafely-prepared food (until they've already eaten it). The profit incentive is to put the bare minimum in place and then sell sell sell.
We may not like the heavy hand of regulation that will clamp down on the software industry, permanently and officially gate it behind the blood-sucking ivory tower of the academic priesthood, and strip it of all vitality and creativity, but with the attitudes that have become prevalent over the last few years, we have no one to blame but ourselves.
Next month a new super magical JS router component-ized and gulp-ified sass-y library is gonna come out and then I'm gonna be again in that position of the "old guy in the shop who is always convincing everyone not to upgrade to the bleeding edgiest version available". And I feel it's an epidemic.
AWS, Azure, and GCP give developers the ability to only worry about their code, and all the hard stuff like database replication, load balancing, security, secrets management, container orchestration and code deployment are handled for them. Linking lego blocks together doesn't make them a coder-hero, it just makes them more productive because the other 80% of the job is done for them, by a cloud provider that already knows how to do it the right way, at planet scale.
You have to know how things work at a reasonably detailed level to know whether or not $cloudOverlord's solution is appropriate or not. If you know that, then you can make an informed decision as to whether or not it's better to go with them, and the reality is that in many cases, there's no real reason to prefer $cloudOverlord's solution. It is, quite often, very expensive, not to mention more complex and entrenching oneself further into dependence on a third-party over which they have no influence or control, and whose business model is finding new ways to charge them [more] rent. It also frequently constricts the availability of patches, features, configs, and upgrades that would be available and useful in a self-administered setup.
As for planet scale, well, there was a deploy on our 100% AWS-backed infrastructure last week that went horrendously and everyone had to pull all-nighters through the weekend trying to troubleshoot the performance problems. "Planet scale" is not something automatically granted by paying through the nose for AWS. Like it or not, you have to have someone who knows what they're doing to get good performance at scale. (Our issues were caused primarily by management's refusal to accept this and preference to believe that just waving the money stick at Amazon would make all problems disappear, since AWS is a super-neato thing from a "planet scale" company.)
The issue that's become very blatant over the last few years is that you get a lot of people who assume that anything except the code they've personally written is a magical fairy box that does everything they want automatically, and then they get mad when they learn that in fact, you still have to understand the tools reasonably well just to use them properly, let alone to debug or troubleshoot issues that may be occurring within them.
Engineering time spent understanding, formulating, and composing the core building blocks of a product is likely to be more important to a project's lifecycle than the time spent writing easily-replaceable business logic in the top layer of the application.
That people are so eager to outsource these fundamental building blocks not out of simple technical expediency ("they're a better WAL wrangler than me and it will take less time for them to do it") but rather out of a sentiment that it's "stupid" to commit the time of "smart engineers" to infrastructure administration and/or design is extremely frustrating.
I understand that within the context of a startup, the VCs want the barebones version to sell ASAP so that they can "test the market" before they give their early-20s sucker-- uh, founders -- more money to burn. Some people may have extrapolated this impulse outside of any context in which it could be considered either responsible or reasonable.
If you don't need "high availability", then it will test your backups and restore process, and if you do need "high availability", it will ensure your failover processes are running smoothly.
Not to mention it trains everyone involved what to do in an emergency since it should be second nature by the time it really happens.
If you can't go "Full netflix" and unleash a chaos monkey on your servers, at least setup a maintenance period where downtime is somewhat expected, and do it then.
I find this similar to when you launch a project that hasn't been used in production yet. Bugs should be expected because it hasn't been battle tested.
People seem to forget that adding a RAID controller creates a single point of failure instead of removing one. :-)
At worst it does both. In most cases it really does just remove a single point of failure (a disk). Other non-RAID configurations likely use a shared controller too. Moving that single point of failure to a different controller doesn't make it any worse.
Now try that with an obscure Adaptec card which has been EOL for the last 4 years :-)
For CPU, RAM there is no other option. you have to replicate stuff.
It's like a perverse version of chaos-monkey, except you want it to destroy you when you are most vulnerable.
I guess the lesson here is not to rely entirely on some HA black magic and always have procedures in place for the 'HA black magic failed us' moments. And team trained to deal with situation like this. It's only software so it will break sooner or later.
1. What caused the crash on the synchronous replica? Was it just a coincidence and completely unrelated to the primary failure?
2. Given the three conditions necessary for the cluster to break, was the behavior of the Pacemaker software expected? I.e., was this a gotcha that should be in the Pacemaker documentation, or a bug?
2. Our understanding now is that INF is the strongest preference, whereas -INF is a veto. It would be very cool to have this confirmed 100% by someone who works on Pacemaker!
I think by still allowing the backup VIP to run on the sync replica the same mistake is being repeated, there will always be the possibility of a situation where the VIP cannot be moved when promotion is required. That replica should be doing nothing but sitting there waiting to save the day, and if they want the backup VIP to be highly available they should provision 2 async replicas.
My early impression of it (can't speak for the rest of the team) was that it was mostly aimed at sharding analytics workloads, but parts of the docs (e.g. https://docs.citusdata.com/en/v7.1/admin_guide/cluster_manag...) make it sound like it handles OLTP workloads too.
Maybe I've been ignoring it for bad reasons!
EDIT: Managing Postgres clusters is something that a lot of people are working on. Thought I'd mention two projects that have me excited right now:
- Patroni https://github.com/zalando/patroni
- Stolon https://github.com/sorintlab/stolon
We do also support some analytics workloads, less so data warehousing, when there is a need for end user facing analytics where higher concurrency and real-time responsiveness is key.
Yeah, crunchy was a little faster with releasing it than Zalando, but try you look closer how they deploy postgres on kubernetes.
Somehow it feels that they are trying to map 1 to 1 the same approach how folks used to run postgres on bare metal. That is: deploy master pod, wait until it's up and running, deploy a replica pod, and so on...
It doesn't really look cloud-k8s-native.
In my opinion such deployment should look absolutely different. You just need to deploy k8s manifest, which will create Secrets, StatefulSet and Service which will be used to connect to the master.
The rest should happen automatically:
* StatefulSet will start N pods with postgres
* pods (Patroni) will elect leader.
* elected leader will initialize (initdb) a new cluster
* all other pods will get basebackup from the leader and become replicas
* if the master(leader) pod die - other pods will elect a new leader
* StatefulSet will start a replacement of failing pod and it will join the cluster as a new replica
And more important all it should happen without connection to Etcd, ZooKeeper or Consul. It should just use Kubernetes API.
Basically all this "magic" is already supported by Patroni https://github.com/zalando/patroni/pull/500
We are planning to merge this PR today or tomorrow, and it will make Patroni+PostgreSQL first class citizen on Kubernetes.
Postgres seems like a better choice for personal projects since it has a lot of nifty features. I'm also wary of Oracle, but that's my own attitude talking. For a startup eventually wanting to scale, would the better choice be to use MySQL out of the gates? Am I being mislead about Postgres clusters and availability?
Serious (naive) question; not wanting to start a flame war.
Since PostgreSQL 10 WAL and logical replication strategies can support just about any sort of replication you desire, except multi-master.
An HA setup can be a master <-> master setup with a VIP using VRRP and a check script ( keepalived). Of course, you have to remain cautious about network partition.
Another thing that might be interesting in some use cases is the ability to skip some replication errors. This is specially interesting in cases where consistency is not critical.
I actually did some setup like that with a replication ring (4 full masters), and an additional daemon re-configuring the ring dynamically when a node was down. It also monitored the transaction log, trimming them if they were about to fill up the disk and setting the GTIDs to the new values. I added some skip error to not block replication. However it was for a very simple DB (session DB containing just one table, but an SQL db was required). Basically I switched from a CA DB to an AP DB, and it's nice to be able to do these kind of things.
I know those are too simplistic setups not taking into account all the failure modes. But it also make them easier to understand and to debug.
Firstly we consider that there are multiple replication possibilities of both technologies- however I'm going to assume the defaults because that's pretty much what everyone uses except if there's an actual case for using something else. It's the exception.
But by default MySQL uses statement based replication (in a weird binary format with log positions and stuff) and postgresql does logical replication (as in, you transmit the binary differences of what you'll be doing to the replica's database files directly and the replica just follows along)
Both of these approaches have trade-offs depending on what you want.
Statement based replication is great if you want to have _different_ datasets on each side, You can transform the data or remove huge chunks of it on a slave and use it for a dedicated purpose. However that applies the other way, you can never really be 100% sure that your replica looks exactly like your master.
this bit me a few times with MySQL when I assumed that because the replica was 'up to date' with the master and it was set to read only, that the data had integrity- it absolutely did not.
MySQL is run more often at extremely large scale (Facebook, YouTube, Twitter, Dropbox, etc.) than Postgres. That results in very battle tested and/or featureful tooling like orchestrator (https://github.com/github/orchestrator), MHA (https://github.com/yoshinorim/mha4mysql-manager), ProxySQL (http://www.proxysql.com/), and gh-ost (https://github.com/github/gh-ost), along knowledge and best practices shared by those organization.
I have been toying with the idea of using it for local devs to access our prod DB for reads (for accurate data) and using a local DB on their machines for writes.
Not sure how to handle UPDATEs, though.
- query routing (e.g., for sharded deployments)
- workload stats/metrics
- query rewriting
I've been migrating away from MySQL for a number of years now.
I like Postgres's physical replication for its straightforwardness. It's pretty easy to tell if your replica is up to date unless something really weird is going on. (undetected data corruption?).
That said, PostgreSQL doesn't really make replication appear easy, so I can understand people thinking that even a basic master-slave setup is difficult (In my experience its behaviour is much easier to understand than with MySQL). However, MySQL is ahead in multi-master user friendliness, and setting up eg. a simple galera cluster is pretty easy.
Whether an "easy" multi-master galera set up is actually production-quality is another matter entirely, but it is not difficult to get up and running.
nope postgresql supports logical replication since 10. https://www.postgresql.org/docs/10/static/logical-replicatio...
> Whether an "easy" multi-master galera set up is actually production-quality is another matter entirely, but it is not difficult to get up and running.
if you have regular network partitons (actually having network paritions is always the case, especially inside clouds) than a galera cluster can actually broke in several cases that are even worse than any failure even the most broken replication on postgresql/mysql non galera can do.
So far I've discovered that TCP keepalives are quite important, otherwise your queries may hang forever after failover (or at least for the default timeout which is like 30 minutes). The connection does not get broken otherwise by the failover.
- If you are running a MultiAZ instance, it is supposed to fail over automatically, but if the problem is in the networking, then you can still effectively lose service. One way around that is to run a read replica in another AZ, and use a Route53 entry with a health check to send traffic to the read replica if the primary isn't reachable. You'll still need to promote the read replica to a master though.
- If you restore from a snapshot, the new EBS volume only pulls blocks of data from S3 as they are requested. So these reads are a lot slower than normal. If you have a large database you could have degraded performance for days. Here is some more info about this: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-resto...
Probably by now Pacemaker would have been abandoned. A
hundred drills would have been enough to flush out these behaviors. If you are afraid to run drills on production equipment, you should be running them on a full-scale production testbed, ideally with mirrored production traffic. With a production-scale testbed, two years is enough to run thousands of risk-free failovers.
Not doing frequent production failure drills is just irresponsible.
How about "there are no magic bullets, full stop"?