
How We Implement Disaster Recovery and High Availability with Postgres at Citus - simonw
https://www.citusdata.com/blog/2017/03/23/a-look-into-disaster-recovery-and-high-availability-and-how-they-work-with-postgres-on-citus-cloud/
======
wjossey
Continue to enjoy the write ups here by the citus team. Your articles are good
primers as well young engineers who may not know how basic database management
looks like in a production environment.

I have a follow up question from the article. You mention that your HA
functionality includes health checks which then trigger auto failover to the
replica in the event of six consecutive failures on the master. What's the
client experience during that time frame? Would the database cluster be
totally unresponsive, or would it be totally transparent so long as it's only
a certain number of nodes experiencing issues?

~~~
craigkerstiens
So it depends a bit on the failure. If it's the coordinator node you're
connected to then everything would be un-responsive. If it's a single
distributed node, then only queries targetting that node would fail. Most
application frameworks and Postgres drivers do try to reconnect so there
wouldn't be a change to application logic, but things during that time would
fail as you might expect when a database is down.

------
simonw
Archiving PostgreSQL WAL (Write Ahead Log) files to S3 on a continuous basis
to provide reliable, fast disaster recovery is a really neat trick.

~~~
jaytaylor
Obligatory link to make this more concrete:
[https://github.com/wal-e/wal-e](https://github.com/wal-e/wal-e)

------
skyisblue
We also wal-e in production and it's been great. Storing your wal logs also
allow you to do point in time recoveries, so you can restore your db to a
particular time before a stuff up happened.

