
High Availability for PostgreSQL, Batteries Not Included - rfks
https://www.compose.io/articles/high-availability-for-postgresql-batteries-not-included/
======
ninkendo
Where I am we have a similar setup for leader election and failover (using
etcd and haproxy) but we add an additional step: a standby instance that does
not participate in master election, and always follows the elected master.

Then we turn on confirmed writes on the master so that the non-participating
standby (called the "seed") has to receive and confirm your write before the
transaction can commit.

This has the bonus of preventing split brain... If the wrong instance thinks
it's master, writes will block indefinitely because the seed isn't confirming
them. If the seed is following the wrong machine, same thing. And if clients
and the seed and the master are all "wrong", then that's ok because at least
they all "consistently" disagree with etcd.

The seed instance can run anywhere, and is responsible for receiving WAL
snapshots from the master and archiving them (to shared storage) so it can
crash too and be brought up elsewhere and catch up fine. The writes just block
until this converges.

It's worked quite well for us for a few months on a hundred or so Postgres
clusters, we haven't seen an issue yet. I'd love for somebody knowledgeable
about this stuff to point out any flaws.

~~~
dap
That's interesting. We do something pretty similar in the Manatee component
that I mentioned elsewhere in this thread, except that the designated
synchronous standby can takeover if the primary goes away. But it can only do
so when another peer is around to become the new synchronous standby, so we
maintain the write-blocking behavior that avoids split-brain.

------
teraflop
Like a lot of designs that use Raft/Zookeeper/Paxos/whatever as a building
block, the full system doesn't inherit all of the safety properties of the
underlying consensus algorithm. I don't think that makes this code useless by
any means, but I think it's important to be aware of the edge cases.

Consensus algorithms are popular because they're supposed to solve the
difficult problem of _guaranteeing_ consistency while attempting to provide
liveness, in the presence of _arbitrary_ node or connection failures. Etcd
itself can provide this for operations on its own datastore, but that doesn't
mean it can be used as a perfect failure detector for another system (which is
impossible in the general case). In particular, if the database master becomes
partitioned from the etcd leader for more than 30 seconds but is still
accessible to clients, boom -- split brain.

(You can attempt to mitigate this with timeouts, but that's not foolproof if
your system can experience clock skew or swapping/GC delays. Exactly this kind
of faulty assumption has caused critical bugs in e.g. HBase in the past,
turning what would otherwise be a temporary period of unavailability into data
loss.)

EDIT: If I'm reading the code correctly, compose.io doesn't make any attempt
to mitigate this failure scenario. If the Postgresql master can't contact
etcd, it continues acting as a master indefinitely, even after 30 seconds have
expired and another server might have taken over. This appears to be what
happens in the "no action. not healthy enough to do anything." case in ha.py.
I'd be happy to be corrected if there's something I'm missing.

~~~
jsprogrammer
If the PostgreSQL leader doesn't reset the leader key, it's no longer leader.

~~~
rosser
Fencing isn't quite that simple, unfortunately.

I've been doing database, and specifically PostgreSQL, administration and HA
setups for a long time now. This stuff is a lot harder than people think it
is. People who roll their own solutions, thinking "Oh, this will totes be good
enough!" tend to find themselves very painfully surprised that it isn't.

~~~
technion
I've seen multiple deployments (not necessarily specific to PostgresSQL)
engineer themselves into a corner with what people feel will be a highly
available roll your own solution, complete with convincing sounding blog
posts.

In every case, at some point, there were implementation/software bug related
issues that ultimately caused more unplanned outages than I've ever seen a
single, well run server experience.

~~~
rdtsc
Based on experience is there a common bug or scenario that you see overlooked
often? Like say what happens during the transition between leaders, or
handling multiple failures (multiple netsplits..)?

~~~
technion
I can't really identify a common problem. Things I've seen include:

* After a complete, planned shutdown, neither server is happy to start until it sees the other one online. In the end, neither ends up booting. * A failover occurs, at which point you find out the hard way there is state being stored in a non-replicate file. I've seen this with several different Asterisk HA solutions in particular. * A failover occurs, and non-database aware storage snapshots leave the redundant server with a non-mountable mirror of the database.

------
pilif
Personally, I would try to go for a simpler solution. In case of a failover
event which is already complicated in itself and happening at a point in time
where stuff is already going wrong (there would be no failover otherwise), do
you really want to have all this additional infrastructure with etcd and
haproxy as a dependency?

If you can live with a few minutes of downtime, I would recommend to trigger
your failover using human intervention once you have ascertained that the
failover would actually help (you never, ever want to fail over if master
doesn't respond in time due to high load - at that point, failing over will
only make things worse due to cold caches).

See [https://github.com/blog/1261-github-availability-this-
week](https://github.com/blog/1261-github-availability-this-week) for a nice
story of automated DB failover going wrong.

In our case, we're running keepalived to share the IP address of the postgres
master, but we don't actually automatically act on PG availability changes.

In a situation that actually warrants the failover, a human will kill the
master node by shutting it down and keepalived will select another master and
trigger the failover (which is then automated using `trigger_file` in
`recovery.conf`).

In this case we have only one additional piece of infrastructure (keepalived)
and we can be sure that we don't accidentally make our lives miserable with
automated failovers.

The cost is, of course, potential additional downtime while somebody checks
the situation, does minimal emergency root cause analysis and then shuts down
the failed master.

In the even rarer case of hardware failure, keepalived would of course fail
over automatically, but let's be honest: Most failures are caused by
application or devops issues and in these cases it pays off to be diligent
instead of panicing.

~~~
jvehent
Can keepalived automatically float MAC addresses nowadays? Last time I
checked, that didn't work and clients needed an arp flush to use the new
master.

~~~
MichaelGG
Shouldn't sending a gratuitous ARP work to update clients?

------
pjungwir
Since most of the comments are critical, I'll say: thank you for the awesome
writeup! I agree this is more complex than HA PG setups I've done in the past,
but I'm thrilled to have another perspective. Also doing a thorough writeup
like this takes time, and a lot of people would rather jump back into building
the next thing. It's a great contribution!

I agree with pilif that you almost always want to failover the db manually.

I agree with teraflop that just because etcd gives strong guarantees, that
doesn't mean your application logic built on top of etcd primitives shares
them. So you have to be careful about your reasoning there.

I'm curious if you're doing anything to mitigate haproxy being a single point
of failure?

One thing I've had to fix in other people's HA PG setups is ease of getting
back to HA after a failover. You lose the master and promote the slave, and
now you've just got a master. Ideally it should be easy to just launch another
db instance and everyone keeps going. I think this setup achieves that, and
that's great!

~~~
gshx
Agree and that's a great point about human failover. It can become a challenge
for distributed databases running on a large number of instances (like
bigtable) but if we're talking only about master HA, then yes, that can still
do with human intervention though automation is still preferable. For smaller
db setups, much easier to just let a human/dba intervene.

~~~
mrkurt
Smaller DB setups rarely have the ops/DBA support required to do manual
failover. I think having an as-consistent-as-feasible, automatic failover is
something of a default expectation for databases these days, at any size.

~~~
Xorlev
You need a larger team to do automatic failover because getting it right is a
massive PITA. Either that or pay someone to do it right for you, e.g. RDS,
managed solutions.

Manual failover is often a lot safer, automatic systems have a nasty habit of
not doing what you expect them to and trashing your database / losing data.

------
dap
Thanks for writing this up!

At Joyent, we built a similar system for automated postgresql failover called
Manatee. I'm sure today we would have used a Raft-based system, but that was
not available when we did this work, so we used ZooKeeper. We haven't spent
much time polishing Manatee for general consumption, but there's a write-up on
how it maintains consistency[1]. The actual component is available here[2],
and it's also been ported to Go as part of Flynn[3].

Edit: Manatee uses synchronous replication, not async, so it does not lose
data on failover.

[1] [https://github.com/joyent/manatee-state-
machine](https://github.com/joyent/manatee-state-machine)

[2] [https://github.com/joyent/manatee](https://github.com/joyent/manatee)

[3] [https://github.com/flynn/flynn](https://github.com/flynn/flynn)

------
imperialWicket
This seems robust, but feels like more moving parts than are necessary.

I feel like HAProxy with PostgreSQL + Bucardo (multi-master + at least one
slave) would achieve this, and net you fewer moving parts. Under what
circumstances does this fail where the etcd-dependent solution succeeds?

~~~
chucky_z
Can I ask why HAProxy seems to be a more popular choice than the very, very
simple (and robust) pgbouncer?

~~~
winsletts
We tested with PGPool and PgBouncer in various iterations.

PGPool failed at basic failover. It worked fine while the leader remained
leader. It would failover to the follower who became leader, but after the
first failover, it would stall on connections. We worked through various
settings and attempts at making it more stable, but in the end we were not
happy with the stability.

PGBouncer requires a connection to a single database and requires a user store
associated at the PGBouncer level. One of our internal requirements for our
Postgres service is give customers full access to Postgres capabilities.
PGBouncer would either limit customer functionality or require us to build
more tools for customers to use Postgres's complete functionality. For
instance, if a customer ran `CREATE USER foo WITH LOGIN …` from the Postgres
connection, the customer would not be authenticate as foo user because
PGBouncer would not have immediate knowledge of the new user.

In the end, HAProxy offered the stability and enabled the base functionality
of Postgres we wanted. In tests, it failed over quickly and reliably. The only
caveat with HAProxy + Postgres is that you have to rely on TCP passthrough
with SSL termination at Postgres. We'd have preferred the SSL termination at
HAProxy, but Postgres engineered it's own connecting procedure to listen for
standard and SSL connections on the same port. SSL termination at the HAProxy
was causing issues for drivers that were built to use that procedure and
cannot use a standard SSL connection.

~~~
chucky_z
Interesting. I got around this by just recreating the users file for pgbouncer
and issuing a reload every 15 minutes or so. This was only needed for one
setup as the rest were just static, or the user was very low volume so I asked
them to connect directly. pgbouncer handled reloads _very_ well (on-par with
nginx) so it worked for me. Thanks for the great insight with your answer.

------
Someone
_" If no one has the leader key it runs health checks and takes over as
leader."_

I'm no expert at all on this stuff, but I do smell either a race condition (if
other nodes comes alive and 'goes to see who owns the leader key in etcd'
before the node 'takes over as leader') or a longer-than-needed time without a
leader (where the new node knows it wants to become the leader, but is running
health checks)

~~~
winsletts
The code relies on functionality in etcd to prevent a race condition. Using
`prevExist=false` on acquiring the leader key, the set will fail if another
node wins the race.

The functionality in the code is here:
[https://github.com/compose/governor/blob/master/helpers/etcd...](https://github.com/compose/governor/blob/master/helpers/etcd.py#L76)

The documentation for etcd is here:
[https://coreos.com/etcd/docs/latest/api.html#atomic-
compare-...](https://coreos.com/etcd/docs/latest/api.html#atomic-compare-and-
swap)

~~~
Someone
But then, isn't it not

 _" If no one has the leader key it runs health checks and takes over as
leader."_

but

 _" If no one has the leader key it takes over as leader, runs health checks,
and starts functioning as leader."_

? If so, I would do the health checks and then try to become the leader. Or do
the 'health checks' involve other nodes?

~~~
merb
It simply relies on the Voting feature of ETCD (Raft) it's really simple to
use locking with etcd, and etcd is really really stable. However it would be
easier to install etcd on every Postgres node and just make a golang library
that sets the master of Postgres to the etcd master (etcd also has a leader).
Also systemd would keep the overall system healthy. (that's what we at envisia
do) Just have repeatedly check if the machine is the leader and if yes it sets
the url of the currently running machine to a etcd key. So overall we need to
use 3 Postgres machines and 1 could fail and we would still have voting,
however thats just for a single master where we don't need to read from the
slaves, however thats easily extendable.

Oh and here is the Compare and Swap (Atomic) functionality of etcd that he
described:
[https://github.com/coreos/etcd/blob/master/Documentation/api...](https://github.com/coreos/etcd/blob/master/Documentation/api.md#atomic-
compare-and-swap)

~~~
winsletts
The problem with etcd members on every Postgres node is that clusters fixed
nodes or members. etcd doesn't function well in an environment where you could
tear down / build up new nodes. Most of our Postgres service runs on AWS, and
thus we must expect that any single node may vanish, and our system must
replace that node. We tried running etcd alongside Postgres in an early
prototype, but ran into issues with etcd cluster stability when destroying and
recreating nodes. Thus, we opt for a stand alone etcd cluster distinct from
the Postgres cluster.

~~~
robszumski
You can set up a local etcd proxy to mitigate this. You'd run the proxy
listening on localhost, and then have it connected to the stable etcd cluster
elsewhere.

The proxy can find the cluster manually or use SRV records. Autoscale the
Postgres machines as much as you want after that while leaving etcd on stable
machines.

~~~
merb
That's what we basically trying to do in the future, however that's really
hard to do if you want to have a running etcd cluster with 5 nodes all the
time. You would need to check if one etcd died, and then either promote a
proxy to a etcd master or run a new machine (the later is only possible in
clouds or virtual environments)

~~~
SEJeff
You can do that trivially with Mesos and have it always ensure 5 instances are
running. Bonus points that it will run identically on bare metal and cross
cloud which means less vendor lock in for you.

------
matrixritter
I really wonder why people go for implementing their own HA stack when there's
Pacemaker or rgmanager available?

First: There's a functional resource agent available to handle PostgreSQL. It
handles single instances or multiple ones.

Second: Zhe whole cluster-thing can be very complex. You can have a LOT of
fail scenarios and I wouldn't recommend to anyone to try to catch them all.

------
wyc
Slightly off-topic:

If you're considering MySQL for HA, a project called Vitess jives well with
Kubernetes + CoreOS, and has been in production use at YouTube for a while
now:

[http://vitess.io/](http://vitess.io/)

------
dgreensp
What does "Batteries Not Included" mean for a highly available database? Is it
a good thing?

~~~
stonemetal
"Batteries Not Included" is a phrase found on the box of children's toys, so
that parents know they will need to buy batteries before giving it to their
kids. It has become an expression for something being incomplete and will
require effort before it will work properly.

In this case the author is stating that Postgres doesn't come with a high
availability capability. He then goes on to explain the high availability
setup he put together.

------
knite
How does this compare to RDS - both in terms of HA, and generally speaking?

------
qaqy
And with this amazing design you can easily loose committed data and have all
sorts of other fun problems.

~~~
chousuke
Is it even possible to guarantee that you won't lose commits with postgresql
replication? For many applications, consistency is more important than not
losing any data ever. For the other kind of application, you'll need something
else.

~~~
dap
With postgresql synchronous replication, in order to lose writes that have
been acknowledged to the postgresql client, you'd have to lose filesystem data
on both the primary and the synchronous standby. (I believe the way postgresql
uses the term "committed", you can lose data that's "committed", but not once
postgresql has acknowledged it to the client.)

For many applications, consistency _includes_ not losing acknowledged data. If
I PUT data into an application and fetch it back and it's not there, that's
not consistent.

~~~
snuxoll
> For many applications, consistency includes not losing acknowledged data. If
> I PUT data into an application and fetch it back and it's not there, that's
> not consistent.

Durability and consistency are two separate concepts.

