Because clients talk to Citus using the synchronous postgres protocol, latency is one of the most important factors in performance. A single connection can never do more than 1/latency transactions per second. While you could use more connections, that is often complex, not always possible (e.g. long transaction blocks), and also comes with a significant performance penalty (fewer prepared statements, high SSL overhead, ...). It's important to optimise for latency.
We find the Citus transaction model is sufficient for most purposes. If you think of the typical balance transfer example, then there is no way to double spend or overspend. Distributed transactions are atomic and locks in postgres will serialise distributed transactions that update the same rows and ensure constraints are preserved. Importantly, Citus has distributed deadlock detection, which allows generous concurrency at almost no cost to the application (apart from maybe having to reorder transaction blocks).
Postgres supports different modes of replication, including synchronous replication to avoid losing any data on node failure. Many users actually prefer asynchronous replication, to avoid the extra round-trip and the broader performance implications that would have. It's also the reason we use streaming replication over other replication mechanisms that might give higher availability. Streaming replication supports higher concurrency and thus gives better throughput.
At the end of the day, people use Citus because a single node postgres (or MySQL) database is not performant enough to keep up with their workload. Performance is key. Citus is not a silver bullet for every postgres performance problem, but it has two particular strengths:
Minimal pain scaling out for multi-tenant applications that can shard by tenant.
- Any SQL query is supported when filtering by partition column, including CTEs, window functions, complex joins, etc,
- N times more CPU, memory and I/O bandwidth to process queries, and 1/N as much data to read
- High throughput, low-latency transactions, including UPDATE/DELETE with subqueries
- Citus Warp can do live migration from postgres (e.g. RDS) using logical decoding
- Major ORMs are supported after small data model changes
- Bulk loading, up to millions of rows/sec
- Massively parallel SQL with subqueries containing outer joins, lateral joins, etc. across all shards when joining by partition column
- Massively parallel roll-ups (again with joins) using INSERT...SELECT
- Massively parallel DELETE and UPDATE
- Integration with other Postgres extensions, such as HLL, PostGIS, pg_partman
- Advanced index types, including GIN, GiST, partial indexes
- Table partitioning (for bulk delete, faster scans)
I would like to understand this better, for my edification. The best source on the exact semantics of pg replication that I've found were in this talk by Andres Freund: https://www.youtube.com/watch?v=VkzNL-uzBvA
What I understand is that "synchronous replication" means that the master commits locally, then waits for a slave (or k of them with quorum replication) to receive (and possibly also to apply) a commit before acknowledging the commit to the client. So what does that mean for "avoiding losing any data on node failure"? Commits acked to their client are indeed durable. But what about commits that are not acked? For example, is it possible for the master to apply a commit which is then seen by another query (from another client), then crash before the commit is sent to the slave? And so, if we then failover to the slave, the commit is gone (even though it had previously been observed)? This would qualify as "losing data".
Similarly, in that talk, Andres mentions that there is a window during which "split brain" is possible between an old master and a new master: when doing a failover, there's nothing preventing the old master from still accepting writes; so you need to tell all the clients to failover at the same time - which may be a tall order for the network partition scenarios when these things become interesting. If the old master is still accepting writes, then these writes diverge from the new master (so, again, lost data). With sync replication, I guess the window for these writes is limited since none of them will be acked to its client (but still, they're visible to other clients) - the window would be one write per connection.
I'm also curious to understand better how people in the Postgres world generally think about failoverd, both in vanilla pg and in Citus). I generally am able to find little information on best practices and the correctness of failovers, even though they're really half the story when talking about replication. For example, when replicating to multiple slaves (even with quorum replication), how does one choose what slave to failover to when a failover is needed? What kind of tool or system decides which one of the slaves has the full log?
My intuition says that this is a pretty fundamental different compared to a consensus-driven architecture like CockroachDB: in Cockroach the "failovers" are properly serialized with all the other writes, so this kind of question about whether one particular node has all the writes or not at failover time is moot.
A commit doesn't become visible until it is synchronously replicated, regardless of whether its ack fails or succeeds. So in the case you're describing the commit is never acked and never observed.
> there's nothing preventing the old master from still accepting writes; so you need to tell all the clients to failover at the same time
In Citus Cloud we detach the ENI to make sure no more writes are going to the old primary, and the attach it to the new primary.
Without such infrastructure, an alternative is to have a shutdown timer for primaries that are on the losing side of a network partition. The system can recover after the timeout.
If you're using a single coordinator, then this only applies to the coordinator. The workers can just fail over by updating the coordinator metadata.
> how does one choose what slave to failover to when a failover is needed?
You can pick the one with the highest LSN among a quorum, since it's guaranteed to have all acknowledged writes.