I like a lot of this, and it is particularly interesting to me as I am currently looking for something to replace a multi-node Corosync/Pacemaker clustered application. I have mixed feelings about writing the solution in Python, while I love the language I worry that it is not safe enough for a rock-solid, never-fail clustering implementation.
However, my biggest problem is that the author seems to completely punt on the most important problem. "To avoid this outcome (split brain), creating a new cluster is a user-specified operation." That's... just not good enough. Any clustering stack worth running must be capable of handling a network partition and continuing operation in a safe manner. In a traditional cluster that means the excluded nodes are fenced and the remaining majority continues. There are other valid solutions, but requiring user intervention in the event of a network split would be a complete non-starter for the type of application we run. We ship hundreds of these systems to customers, and they must be able to run without user intervention at all times. If something fails it might be days before we can get a tech out to investigate the problem. Not being able to handle this situation almost seems to miss the entire point of implementing a good consensus algorithm in the first place.
However, my biggest problem is that the author seems to completely punt on the most important problem. "To avoid this outcome (split brain), creating a new cluster is a user-specified operation." That's... just not good enough. Any clustering stack worth running must be capable of handling a network partition and continuing operation in a safe manner. In a traditional cluster that means the excluded nodes are fenced and the remaining majority continues. There are other valid solutions, but requiring user intervention in the event of a network split would be a complete non-starter for the type of application we run. We ship hundreds of these systems to customers, and they must be able to run without user intervention at all times. If something fails it might be days before we can get a tech out to investigate the problem. Not being able to handle this situation almost seems to miss the entire point of implementing a good consensus algorithm in the first place.