I would like to understand this better, for my edification. The best source on the exact semantics of pg replication that I've found were in this talk by Andres Freund: https://www.youtube.com/watch?v=VkzNL-uzBvA
What I understand is that "synchronous replication" means that the master commits locally, then waits for a slave (or k of them with quorum replication) to receive (and possibly also to apply) a commit before acknowledging the commit to the client. So what does that mean for "avoiding losing any data on node failure"? Commits acked to their client are indeed durable. But what about commits that are not acked? For example, is it possible for the master to apply a commit which is then seen by another query (from another client), then crash before the commit is sent to the slave? And so, if we then failover to the slave, the commit is gone (even though it had previously been observed)? This would qualify as "losing data".
Similarly, in that talk, Andres mentions that there is a window during which "split brain" is possible between an old master and a new master: when doing a failover, there's nothing preventing the old master from still accepting writes; so you need to tell all the clients to failover at the same time - which may be a tall order for the network partition scenarios when these things become interesting. If the old master is still accepting writes, then these writes diverge from the new master (so, again, lost data). With sync replication, I guess the window for these writes is limited since none of them will be acked to its client (but still, they're visible to other clients) - the window would be one write per connection.
I'm also curious to understand better how people in the Postgres world generally think about failoverd, both in vanilla pg and in Citus). I generally am able to find little information on best practices and the correctness of failovers, even though they're really half the story when talking about replication. For example, when replicating to multiple slaves (even with quorum replication), how does one choose what slave to failover to when a failover is needed? What kind of tool or system decides which one of the slaves has the full log?
My intuition says that this is a pretty fundamental different compared to a consensus-driven architecture like CockroachDB: in Cockroach the "failovers" are properly serialized with all the other writes, so this kind of question about whether one particular node has all the writes or not at failover time is moot.
A commit doesn't become visible until it is synchronously replicated, regardless of whether its ack fails or succeeds. So in the case you're describing the commit is never acked and never observed.
> there's nothing preventing the old master from still accepting writes; so you need to tell all the clients to failover at the same time
In Citus Cloud we detach the ENI to make sure no more writes are going to the old primary, and the attach it to the new primary.
Without such infrastructure, an alternative is to have a shutdown timer for primaries that are on the losing side of a network partition. The system can recover after the timeout.
If you're using a single coordinator, then this only applies to the coordinator. The workers can just fail over by updating the coordinator metadata.
> how does one choose what slave to failover to when a failover is needed?
You can pick the one with the highest LSN among a quorum, since it's guaranteed to have all acknowledged writes.