
PostgreSQL HA cluster failure: a post-mortem - stevedomin
https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/
======
foobarbazetc
I’ve been involved in various Postgres roles (developer on it, consultant for
it for BigCo, using it at my startup now) for around 18 years.

I’ve never in that time seen a pacemaker/corosync/etc/etc configuration go
well. Ever. I have seen corrupted DBs, fail overs for no reason, etc. The
worst things always happen when the failover doesn’t go according to plan and
someone accidentally nukes the DB at 2am.

The lesson I’ve taken from this is it’s better to have 15-20 minutes of
downtime in the unlikely event a primary goes down and run a manual
failover/takeover script then it is to rely on automation. pgbouncer makes
this easy enough.

That said, there was a lot of bad luck involved in this incident.

~~~
morrbo
How does pgbouncer make this process easy? Just because there are less
connections to go to the final DB? I've also got a random question you might
be able to answer (having a hard time googling)... when using pgsql in
streaming replication mode, are created functions replicated/updated to
everything else as well?

(just learning about postgres and saw an opportunity to ask someone in the
know) Cheers!

~~~
pjungwir
> are created functions replicated/updated to everything else as well?

Yes.

Note if those functions have an implementation compiled from C, you do need to
install the .so on the standbys though.

~~~
koffiezet
You need the .so's on the replica's simply for the database structure, but
afaik, it's only the on-disk format that's being synchronized, functions are
not executed on the replicas. I'm not even sure this would throw errors if the
.so's would be missing - but it will give problems when a replica is promoted
to the new master.

~~~
pjungwir
I agree: it's only because you need them when you fail over.

------
sandGorgon
> _Fortunately, as part of some unrelated work we 'd done recently, we had a
> version of the cluster that we could run inside Docker containers. We used
> it to help us build a script that mimicked the failures we saw in
> production. Being able to rapidly turn clusters up and down let us iterate
> on that script quickly, until we found a combination of events that broke
> the cluster in just the right way._

this is the coolest part of this story. Any chance these scripts are
opensource ?

~~~
Sinjo
We plan to open source it as soon as we can. Tiny bit more work to do, then
review from a couple of people in the team, then we make it public. :)

~~~
sandGorgon
Thanks for this !

------
linker3000
Since I am investigating HA with PostgreSQL right now and have bitter
experience of Pacemaker 'HA' instances that have been anything but, I am
looking at Amazon Aurora and Microsoft's (in preview) Azure database for
PostgreSQL offerings. I would really appreciate any insight from others who
are already using them (we intend to do some PoC work shortly).

Our dev team also came up with some pertinent questions, which we have put to
both companies, but if anyone else can comment from experience that would be
fantastic:

* Is the product a fork of PostgreSQL or a wrapper round the current version?

* Will the DB engine keep in lock-step with new PostgreSQL releases or might they diverge?

* If the DB engine keeps in lock-step, what’s the period between a new version of PostgreSQL being released before its incorporated in the live product?

* When new versions of Amazon Aurora/Azure DB for PostgreSQL are released will our live instance get automatically updated or will we be able to choose a version?

~~~
lattepiu
I can comment on Amazon Aurora:

> Is the product a fork of PostgreSQL or a wrapper round the current version?

Aurora is a fork: they've re-written a significant chunk of the engine. Note
that Amazon also offers RDS PostgreSQL, which is a managed version of the
"regular" PostgreSQL engine. RDS PostgreSQL also offers a HA setup (no version
upgrade without downtime, however). It works quite well.

> Will the DB engine keep in lock-step with new PostgreSQL releases or might
> they diverge?

Amazon promises to keep it in lock-step. How soon they will release an upgrade
to a major version remains to be seen.

> When new versions of Amazon Aurora/Azure DB for PostgreSQL are released will
> our live instance get automatically updated or will we be able to choose a
> version?

Minor version upgrades are applied automatically. For major version upgrades,
it's unclear at this time (there hasn't been one yet for Aurora PostgreSQL),
but I think it's unlikely they will be applied automatically.

------
devit
I think the root issue is that PostgreSQL does not offer an HA solution that
works out of the box with minimal configuration, resulting in people using
broken third-party ones and/or configuring them incorrectly.

They should either provide one or "bless" an external solution as the official
one (after making sure it works correctly).

The other problem is that GoCardless setup an asynchronous and a synchronous
replica instead of 2 synchronous replicas (or preferably 4+), resulting in
only two points of failure, which is not enough.

~~~
Sinjo
Pretty much number 1 on my Postgres wishlist.

------
tomc1985
Makes me sad that running your own instances is now an "elephant in the room."
No pride in old-school do-it-yourself nerditry these days :/

~~~
rosser
Getting HA right is _hard_. DIY-ing it incurs risk, possibly deliberately, out
of Not-Invented-Here-ism.

Source: PostgreSQL DBA for over a decade; have built multiple HA environments;
have seen many ways of "doing it wrong", and how those can end up biting their
creators.

~~~
nh2
On the other hand:

With hosted Postgres, when a failure does happen, isn't it much harder to get
at the log files? They seem extremely useful to diagnose the problem and make
sure it doesn't happen again, as the article shows. What's your experiene
here, can you get at logs easily with hosted Posgres offerings?

And it seems the only way to get reliable Postgres HA for everyone, and to
weed out the bugs, is if more people run Posgres HA themselves. For example, I
find Stolon and Patroni great, but I would be more relaxed about them if they
had 100x more users.

~~~
rosser
We aren't using hosted postgres (much, yet). We provision EC2 instances and
self-manage it. Failover is scripted, and manually invoked as needed.

None of us trust any of the automated failover solutions enough to use them.
We want human judgement in that loop, even if it means being woken at 3AM to
push the button. It's _that_ hard to get right.

Just one incident like The Fine Article's is well more than our entire
infrastructure's total downtime for the rolling year, and we have hundreds of
postgres instances.

Done wrong, automated failover is a net _increase_ in risk. And, in case my
thesis is somehow unclear, _it 's hard to get right_.

------
dboreham
This is why you should be extremely wary of anything that is only run once in
a a blue moon. And very wary of such things that when run, are being run to
save your bacon.

~~~
rosser
"Your backups are only as good as their last successful restore."

~~~
CodeWriter23
[https://feedbin.com/blog/2017/05/05/testing-huge-
postgresql-...](https://feedbin.com/blog/2017/05/05/testing-huge-postgresql-
backups-for-37-cents-a-day/)

------
ahoka
"The RAID controller logged the simultaneous loss of 3 disks from the array.
All subsequent read and write operations against it failed."

People seem to forget that adding a RAID controller creates a single point of
failure instead of removing one. :-)

~~~
majidazimi
Always use software RAID. I'd rather loose 30% on efficiency than creating a
single point of failure with a random hardware. Software RAID on modern OSes
(Linux, FreeBSD, ... ) are pretty darn reliable and fast.

~~~
king_phil
And the SATA/SAS controller of your Mainboard is invincible? Or the Mainboard
itself? Or the CPU? What's your backup solution if they are faulty?

~~~
majidazimi
I use two different controllers in JBOD mode. Each controller manages half of
HDDs. Then I build cross controller software RAID 1. That is, I choose two
disks from separate controllers to build raid 1. Last but not least, all HDDs
are mixed from different vendors across controllers.

For CPU, RAM there is no other option. you have to replicate stuff.

------
jskrablin
Pacemaker is known to wreak havoc if it gets angry. The usual path to quick
recovery when the cluster goes crazy like this is to make really sure what's
the most up to date replica, shut down Pacemaker completely, assign VIP
manually to a healthy replica and promote it manually. Then once you're up and
back in the business figure out how to rebuild the cluster.

~~~
barkingcat
If this is indeed true, doesn't this negate the purpose of pacemaker to begin
with? It's like anti-software. When you run with it in your environment, to
recover from a failure (which seems to me what HA software should be about)
you have to turn it off first or else it will destroy your recovery attempts.

It's like a perverse version of chaos-monkey, except you want it to destroy
you when you are most vulnerable.

~~~
jskrablin
It's great when it works as expected. When it doesn't... then the fun begins.
I've found it quite fragile, components versions sensitive, configuration
sensitive, etc. Most of the time I've seen Pacemaker gone crazy Pg itself was
happy to cooperate once the Pacemaker was out of the way. The unknown/weird
Pacemaker failure modes were a real (and scary) problem.

I guess the lesson here is not to rely entirely on some HA black magic and
always have procedures in place for the 'HA black magic failed us' moments.
And team trained to deal with situation like this. It's only software so it
will break sooner or later.

------
testplzignore
Good write-up. I'm curious about two more things:

1\. What caused the crash on the synchronous replica? Was it just a
coincidence and completely unrelated to the primary failure?

2\. Given the three conditions necessary for the cluster to break, was the
behavior of the Pacemaker software expected? I.e., was this a gotcha that
should be in the Pacemaker documentation, or a bug?

~~~
Sinjo
1\. Unfortunately the logs don't give any detail there. Most likely something
arrived down the replication connection that the process couldn't handle, and
it crashed.

2\. Our understanding now is that INF is the strongest preference, whereas
-INF is a veto. It would be very cool to have this confirmed 100% by someone
who works on Pacemaker!

------
cavisne
The end of this post mortem was a bit handwavy TBH. I feel like they didnt dig
deep enough, and the problem was the backup VIP, not the two processes
crashing at once and the backup VIP.

I think by still allowing the backup VIP to run on the sync replica the same
mistake is being repeated, there will always be the possibility of a situation
where the VIP cannot be moved when promotion is required. That replica should
be doing nothing but sitting there waiting to save the day, and if they want
the backup VIP to be highly available they should provision 2 async replicas.

------
kodablah
I too am coming up on a need for no-downtime HA failover for Postgres. I too
am not allowed to use a hosted PaaS-ish solution like RDS. I was considering
Citus's multi master impl (I don't need to spread the load, just need HA). I
had not considered Pacemaker. Has GoCardless investigated this option and have
any insight to give? HA has traditionally been a real pain point for
traditional RDBMS's in my experience.

~~~
Sinjo
To be honest we've not looked into Citus in any depth.

My early impression of it (can't speak for the rest of the team) was that it
was mostly aimed at sharding analytics workloads, but parts of the docs (e.g.
[https://docs.citusdata.com/en/v7.1/admin_guide/cluster_manag...](https://docs.citusdata.com/en/v7.1/admin_guide/cluster_management.html#worker-
node-failures)) make it sound like it handles OLTP workloads too.

Maybe I've been ignoring it for bad reasons!

EDIT: Managing Postgres clusters is something that a lot of people are working
on. Thought I'd mention two projects that have me excited right now:

    
    
      - Patroni https://github.com/zalando/patroni
      - Stolon https://github.com/sorintlab/stolon
    

Stolon's client proxy approach in particular looks interesting, and reminds me
of how people are using Envoy
([https://github.com/envoyproxy/envoy](https://github.com/envoyproxy/envoy)),
albeit as a TCP proxy rather than one that understands and can do fun stuff
with the database's protocol. I wonder if we'll start to see more Envoy
filters for different databases!

~~~
craigkerstiens
Craig from Citus here. Since we grew transactional support a couple of years
ago and a number of the features we've supported since then much of our
traction has come from those outgrowing single node Postgres and needing more
performance. So in short we're very much focused on handling and supporting
OLTP workloads.

We do also support some analytics workloads, less so data warehousing, when
there is a need for end user facing analytics where higher concurrency and
real-time responsiveness is key.

------
echelon
I'm told that MySQL replication blows Postgres out of the water by my
company's data team, but they could just be biased since that is their area of
expertise. I work on server code and don't really have much familiarity with
the operations of running replica chains.

Postgres seems like a better choice for personal projects since it has a lot
of nifty features. I'm also wary of Oracle, but that's my own attitude
talking. For a startup eventually wanting to scale, would the better choice be
to use MySQL out of the gates? Am I being mislead about Postgres clusters and
availability?

Serious (naive) question; not wanting to start a flame war.

~~~
dijit
Having ran MySQL in prod for a decade and PostgreSQL in prod for half a decade
I can say without doubt that your data team is telling fibs.

Firstly we consider that there are multiple replication possibilities of both
technologies- however I'm going to assume the defaults because that's pretty
much what everyone uses except if there's an actual case for using something
else. It's the exception.

But by default MySQL uses statement based replication (in a weird binary
format with log positions and stuff) and postgresql does logical replication
(as in, you transmit the binary differences of what you'll be doing to the
replica's database files directly and the replica just follows along)

Both of these approaches have trade-offs depending on what you want.

Statement based replication is great if you want to have _different_ datasets
on each side, You can transform the data or remove huge chunks of it on a
slave and use it for a dedicated purpose. However that applies the other way,
you can never really be 100% sure that your replica looks exactly like your
master.

this bit me a few times with MySQL when I assumed that because the replica was
'up to date' with the master and it was set to read only, that the data had
integrity- it absolutely did not.

~~~
antoncohen
I don't think the claim of MySQL replication being better is related to
statement vs. row vs. binary diff. I think it is about the tooling and
community knowledge about replication, and about running MySQL in large scale
production environments in general.

MySQL is run more often at extremely large scale (Facebook, YouTube, Twitter,
Dropbox, etc.) than Postgres. That results in very battle tested and/or
featureful tooling like orchestrator
([https://github.com/github/orchestrator](https://github.com/github/orchestrator)),
MHA ([https://github.com/yoshinorim/mha4mysql-
manager](https://github.com/yoshinorim/mha4mysql-manager)), ProxySQL
([http://www.proxysql.com/](http://www.proxysql.com/)), and gh-ost
([https://github.com/github/gh-ost](https://github.com/github/gh-ost)), along
knowledge and best practices shared by those organization.

~~~
amigoingtodie
What are anybody's real-world use case for proxysql?

I have been toying with the idea of using it for local devs to access our prod
DB for reads (for accurate data) and using a local DB on their machines for
writes.

Not sure how to handle UPDATEs, though.

~~~
emfree
Use cases for ProxySQL: many.

\- failover

\- query routing (e.g., for sharded deployments)

\- caching

\- workload stats/metrics

\- query rewriting

etc.

------
qaq
There are specialized tools like Patroni
([https://github.com/zalando/patroni](https://github.com/zalando/patroni))

~~~
emXdem
I like Stolon:
[https://github.com/sorintlab/stolon](https://github.com/sorintlab/stolon)

------
Erwin
If you are running HA in AWS RDS, how would you compare your experience with
the above? What are the types of RDS failures modes that you have experienced?

So far I've discovered that TCP keepalives are quite important, otherwise your
queries may hang forever after failover (or at least for the default timeout
which is like 30 minutes). The connection does not get broken otherwise by the
failover.

~~~
pjungwir
Here are a couple gotchas I've seen on RDS:

\- If you are running a MultiAZ instance, it is supposed to fail over
automatically, but if the problem is in the networking, then you can still
effectively lose service. One way around that is to run a read replica in
another AZ, and use a Route53 entry with a health check to send traffic to the
read replica if the primary isn't reachable. You'll still need to promote the
read replica to a master though.

\- If you restore from a snapshot, the new EBS volume only pulls blocks of
data from S3 as they are requested. So these reads are a lot slower than
normal. If you have a large database you could have degraded performance for
days. Here is some more info about this:
[http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-
resto...](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-restoring-
volume.html)

------
ncmncm
I am astonished that, in the two years, you had not already handled 100+
scheduled failovers. If your HA is good, customers don't notice, and if not,
you find out when there are fewer of them (and in daytime!), and fix it.

Probably by now Pacemaker would have been abandoned. A hundred drills would
have been enough to flush out these behaviors. If you are afraid to run drills
on production equipment, you should be running them on a full-scale production
testbed, ideally with mirrored production traffic. With a production-scale
testbed, two years is enough to run thousands of risk-free failovers.

Not doing frequent production failure drills is just irresponsible.

------
derekmhewitt
I'm seeing the term "HA" being used a lot in the comments here, what does it
mean?

~~~
inimino
high availability

------
emXdem
Stolon with etcd or consul is a far superior solution for HA postgres.

------
notyourday
Stop pretending that there's a magic bullet called "multi-master" and
"transparent promotion". Your apps are super simple. Their DB interactions are
super simple. Learn how to do federations and all these problems will go away.

~~~
rosser
It's curious that you decry some kinds of "magic bullet" in favor of another.

How about "there are no magic bullets, _full stop_ "?

