
A More Flexible Paxos - dctrwatson
http://ssougou.blogspot.com/2016/08/a-more-flexible-paxos.html
======
mjb
See Lamport's "Cheap Paxos" and "Vertical Paxos" papers for a treatment of
what seems like very similar ideas. Vertical Paxos describes using a
'configuration master' (or, possibly, a larger quorum on a larger jury) to
make changes to a smaller jury. The details are a bit subtle, but the PlusCal
code in the paper lays things out very clearly.

Also check out Check out Peleg and Wool, "The Availability of Quorum Systems"
(it's a bit old-fashioned, but still very interesting) for a thorough
discussion of the availability argument (separate from Paxos, but in context
of Quorum systems in general).

------
sougou
I'm the author of the blog. Thanks for the posts and interest shown.

This proposal is a generalization of the core Paxos algorithm. It's therefore
orthogonal to all customizations and adaptations. I believe that they can all
benefit from it, including RAFT.

If you read the thread on raft-dev, you'll see that Heidi Howard has
independently come up with the same idea, and is working on a proof for it.

------
teraflop
There was some discussion of this post on the raft-dev mailing list the other
day: [https://groups.google.com/forum/?fromgroups=#!topic/raft-
dev...](https://groups.google.com/forum/?fromgroups=#!topic/raft-
dev/wANZPzSz6QE)

------
ahachete
I'm not Paxos expert either, but I see some issues with the current post that
are probably not correct / totally precise:

\- That 7 or more nodes exhibit degraded performance, and that 5 is the sweet
spot seems like a handy-wavy statement. I'd like to see numbers. And I believe
those will be related more to implementation details rather than algorithmical
limitations. Sure, with more nodes, communication overhead grows. But only a
majority needs to agree, so there's no need to wait for the slowest node. Sure
there's a sweet spot, but I wouldn't say it's exactly 5.

\- Paxos _does not_ perform leader election. There's indeed, no concept of a
leader. There's only proposer, acceptor and learner. There's a leader concept
in Multi-paxos, but that's a different story, and it's just a performance
improvement, not a requirement.

\- The original Paxos paper may be harder to understand, but "Paxos made
Simple" is not. Probably easier to understand (IMVHO) than Raft's paper.

\- The post assumes, as most Paxos implementations, that the three Paxos roles
(proposer, acceptor and learner) are collapsed into a single node. They could
be in different nodes. This may challenge some of the assumptions made about
the "regular" Paxos algorithm.

\- What the post proposes could probably be easily simulated with a varying
number of proposers, acceptors and learners (like for instance having a lower
number of acceptors than learners). This should result in the same benefits of
the modified algorithm... but without modifying the algorithm. This is
important, since original algorithm is proven mathematically.

------
dsp1234
Just like the author, I'm no Paxos expert.

However, I see a clear issue with the proposed model.

Taking the authors 11 node set up. 9 nodes are required for leader election. 3
nodes are needed for proposals.

Assume that a network partition splits the nodes with 5 nodes on side A and 6
nodes on side B. Assume also that the leader is on side B (though it works
without loss of generality for side A as well).

At this point, no leader re-election can occur, since no side has the 9 total
nodes required for a re-election. It's obvious that if the leader dies (or
died as part of the partition) that no leader can be re-elected.

With the partition in place:

1.) let 1 node in side A make the proposal "X=3"

2.) Let 2 other nodes in side A agree to the proposal that "X=3"

3.) Let 1 node in the side B make the proposal "X=0"

4.) Let 2 other nodes in side B agree to the proposal that "X=0"

5.) Let the partition heal

6.) The cluster is now split brain on whether "X=0" or "X=3". According to the
rule provided, the leader must accept any successful proposal.

Note that this split brain does not even require a network partition. Assume
no network partition, and nodes are labeled 0-10. The leader is currently node
10.

1.) Let node 1 propose that "X=4"

2.) Let node 2 propose that "X=5"

3.) Let nodes 3 and 9 agree to the proposal that "X=4"

4.) Let nodes 6 and 8 agree to the proposal that "X=5"

5.) Split brain occurs as the leader has to accept any successful proposal,
which includes "X=4" from nodes 1, 3 and 9, and also "X=5" from nodes 2, 6 and
8.

~~~
oggy
The split-brain scenarios you mention cannot occur, since all the proposals
have to go through the leader before they are agreed to by any processes.

Moreover, the leaders are lease-based in the author's proposal, so I believe
that what he describes correctly solves consensus. However, his proposal
doesn't really give much of a resilience improvement over the original Paxos,
since the system, in the worst case, tolerates only 2 failures (leader + one
more process), same as a 5-process Paxos. Nevertheless, an interesting idea,
one that I don't think I've seen elsewhere.

~~~
maccilia
So I'm pretty a similar problem could exist. If to accept a proposal we only
need 3 nodes, not a majority, and we manipulate network latency in arbitrarily
bad ways, a similar split brain could occur. Basically, you could commit the
same transaction twice under different proposal numbers.

~~~
oggy
Consensus is not about transactions, it's about agreeing on a value (a single
value, strictly speaking, unlike atomic broadcast, which is about agreeing on
a sequence of values, although people often identify the two). But anyway,
your argument is not in any way concrete - try to come up with a
counterexample. Having thought about it more, I am confident that the author's
system is correct (in the sense that it always preserves agreement).

------
fizx
There's no theoretical reason why multipaxos can't be implemented
hierarchically in order to increase throughput or expand the key space beyond
what a small number of machines can handle. But paxos itself is complicated
enough to deter almost all implementers, so add two additional levels of
complexity? I don't think I'm surprised no one has tried.

Paxos: Agree on a value.

Multipaxos: Agree on a Leader, who will coordinate a set of values.

Hierarchical Multipaxos: Agree on a Leader, who will coordinate a set of
Leaders, who will each coordinate a set of values.

If you squint hard enough, this looks almost like a B-Tree.

~~~
im_down_w_otp
riak_ensemble is a hierarchical multi-paxos shaped system. It's what Riak
Strong Consistency buckets are built on, but it's also a stand-alone
consumable framework by itself.

~~~
fizx
Nifty! Did not know that.

------
bogomipz
"With systems getting more and more distributed, the Paxos algorithm has been
gaining popularity."

Has it though? I feel like many of the new distributed systems I read about
are Raft based consensus.

~~~
dragonwriter
> Has it though?

AFAICT, Yes.

> I feel like many of the new distributed systems I read about are Raft based
> consensus.

Sure, and Raft has probably been gaining popularity even more than Paxos has.
With more and more distributed systems and more attention on guarantees for
such systems, its possible for both Paxos _and_ Raft (and perhaps other
techniques) to be getting more popular.

