
Clustering by Consensus - nickpresta
http://aosabook.org/en/500L/clustering-by-consensus.html
======
keypusher
I like a lot of this, and it is particularly interesting to me as I am
currently looking for something to replace a multi-node Corosync/Pacemaker
clustered application. I have mixed feelings about writing the solution in
Python, while I love the language I worry that it is not safe enough for a
rock-solid, never-fail clustering implementation.

However, my biggest problem is that the author seems to completely punt on the
most important problem. "To avoid this outcome (split brain), creating a new
cluster is a user-specified operation." That's... just not good enough. Any
clustering stack worth running must be capable of handling a network partition
and continuing operation in a safe manner. In a traditional cluster that means
the excluded nodes are fenced and the remaining majority continues. There are
other valid solutions, but requiring user intervention in the event of a
network split would be a complete non-starter for the type of application we
run. We ship hundreds of these systems to customers, and they must be able to
run without user intervention at all times. If something fails it might be
days before we can get a tech out to investigate the problem. Not being able
to handle this situation almost seems to miss the entire point of implementing
a good consensus algorithm in the first place.

