Hacker News new | past | comments | ask | show | jobs | submit login

Kubernetes can’t change any database’s HA or durability features; there’s no magic k8s can apply to make a database that does e.g. asynchronous replication have the properties of one that does synchronous replication. So you’ll never gain any properties your underlying database is incapable of providing.

However, if I had to run Postgres as part of something I deployed on k8s AND for some reason couldn’t use my cloud provider’s built in solution (AWS RDS, Cloud SQL, etc.) I would probably go with using/writing a k8s operator. The big advantage of this route is that it gives you good framework for coordinating the operational changes you need to be able to handle to actually have failover and self-healing from a Postgres cluster, in a self-contained and testable part of your infrastructure.

When setting up a few Postgres nodes with your chosen HA configuration you’ll quickly run into a few problems you have to solve:

* I lose connectivity to an instance. Is it ever coming back? How do I signal that it’s dead and buried to the system so it knows to spin up a fresh replica in the cases where this cannot be automatically detected?

* How do I safely follow the process I need to when upgrading a component (Postgres, HAProxy, PGBouncer, etc.)? How do I test this procedure, in particular the not-so-happy paths (e.g. where a node decides to die while upgrading).

* How do I make sure whatever daemon that watches to figure out if I need to make some state change to the cluster (due to a failure or requested state change) can both be deployed in a HA manner AND doesn’t have to contend with multiple instances of itself issuing conflicting commands?

* How do I verify that my application can actually handle failover in the way that I expect? If I test this manually, how confident am I that it will continue to handle it gracefully when I next need it?

A k8s operator is a nice way to crystallize these kinds of state management issues on top of a consistent and easily observable state store (namely the k8s API’s etcd instance). They also provide a great way to run continuous integration tests that you can actually throw the situations you’re trying to prepare for at the implementation of the failover logic (and your application code) to actually give you some confidence that your HA setup deserves the name.

But again, I wouldn’t bite this off if you can use a managed service for the database. Pay someone else to handle that part, and focus on making your app actually not shit the bed if a failover of Postgres happens. The vast majority of applications I’ve worked on that were pointed at a HA instance would have (and in some cases did) broken down during a failover due to things like expecting durability but using asynchronous replication. You don’t get points for “one of the two things that needed to work to have let us avoid that incident worked”.

> AND for some reason couldn’t use my cloud provider’s built in solution (AWS RDS, Cloud SQL, etc.)


> Pay someone else to handle that part

Aside from the money (managed Postgres is expensive), I'd actually like to understand what good, high-availability Postgres solutions look like.

Totally valid if you’re reaching the larger instance sizes. However I’d caution you to not underestimate what running a good HA Postgres setup costs in engineering/operations time (particularly if you’re not familiar with running one). Be ready to get a clearheaded TCO that you won’t be happy with.

In general, you need some sort of separate strictly serializable store run in an HA configuration to manage the state changes, and whatever is managing the state changes needs to be run in multiple instances across your fault domains as well. Others have mentioned Patroni; I’ve used it before (though with etcd, not k8s) and been quite happy with it. Be aware that (as it cautions you in the README) it’s not a total point and shoot tool, you do need to read through it’s caveats and understand the underlying Postgres replication features.

The documentation is pretty good, if you want to get an idea of what the logic looks like they have a nice state diagram: https://github.com/zalando/patroni/blob/master/docs/ha_loop_...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact