
PigPaxos: Removing the Scalability Bottlenecks in Paxos - mad44
https://muratbuffalo.blogspot.com/2020/03/pigpaxos-devouring-communication_18.html
======
excerionsforte
Interesting idea moving the reception of messages to other nodes and getting
back smaller set of messages. Not only that but distributing the reception
responsibility into relay groups where it can be load balanced among peers.

The leader using relay nodes reminds me of how humans organize into boss and
worker groups and the boss shouts out orders to group leaders.

I would like to see how dynamic relay groups will perform under stress and I
wonder who would communicate new relay groups. Choice of leader imposing relay
group structure or groups self organizing.

~~~
hinkley
I use Raft for the same reason I use Ethernet. As my distributed computing
teacher said, it’s a terrible protocol but the best one we have.

Raft, and I believe Panos, contain some of the 8 Fallacies of Distributed
Computing. The most unavoidable being assuming the network is heterogenous. I
won’t say this keeps me up at night, but it distracts me when I’m eating
lunch.

Mere mortals don’t have all identical hardware or all equidistant network ping
times. And even if you start that way, upgrades will at least temporarily
change that dynamic. Geographical redundancy will _really_ change that
dynamic. And often it’s when we are changing things that they break. So it’s
all fine until you get new customers and have to upgrade and then all of your
new customers get to watch you crash and burn during an upgrade.

Rabbit and Consul reduce this surface area by having two classes of users.
I’ve heard of Raft variants with three (pure consumers, full voting members,
and new or returning members who are still catching up) so that you can use
machines unprepared for leadership and they will never try to elect themselves
leader.

Right now we are still in a weird era where networking is so stupid fast that
it’s faster to get data from another machine than from your own storage, but
that can’t last forever. Berkeley did some really crazy things when this
inversion of costs happened in the 80’s, but by the time I learned about it
less than 10 years later, we were already tut tutting about how foolish that
was.

I think the big thing I’ve been waiting for from a relay group type solution
is for the participants to discover the network topology and elect a local
representative to receive and rebroadcast the inbound stream of updates.

I don’t know that you have to funnel all traffic through that intermediary,
the way statsd does. Just the cumulative event stream could represent an order
of magnitude decrease in packets with only a slight increase in latency.

~~~
dnautics
> Raft, and I believe Panos, contain some of the 8 Fallacies of Distributed
> Computing. The most unavoidable being assuming the network is heterogenous.

Did you mean homogeneous?

Also, Raft doesn't assume the network is homogeneous.

> I’ve heard of Raft variants with three (pure consumers, full voting members,
> and new or returning members who are still catching up) so that you can use
> machines unprepared for leadership and they will never try to elect
> themselves leader.

What's the purpose of these non-leader-competent nodes? Is it merely to
increase the quorum membership and make consistency more difficult? Why
wouldn't you architect this as a client/server relationship?

~~~
grogers
Regarding the other types of nodes, pure consumers are useful as "standby"
nodes. They are ready to be swapped in when another host fails, but don't
participate in voting until they are swapped in. Swapping them in still
requires a quorum of the previous nodes, so they are only useful in small
numbers to be swapped in when another node fails. Nodes that are catching up
are just a special case of this pure consumer type that shouldn't be promoted
to be a full voting member until they are fully caught up.

I'll add another type of node that sometimes comes in handy in certain network
topologies. I've typically heard this called a "witness" \- it is a voting
member but doesn't store any actual data (just metadata) so it can't be a
leader. The typical use for this is if you have a data center cluster with
only two DCs.

In that scenario you can make the number of nodes in each DC unbalanced so if
the minority DC goes down you still have a quorum. But you can't make it so
that if either DC goes down you can still have a quorum. So you add a third DC
over a WAN that acts as an arbitrator between the two. Because it's far away,
you don't want it to store all the data - keeping it in sync would be
expensive. You only need it to know enough metadata to vote correctly.

This obviously changes the failure characteristics quite a bit vs having 3 DCs
in the cluster like normal. Such as how many real nodes and witness nodes you
can tolerate failing. Latency while one of your your main DCs are down is much
higher than normal (because you now need acks from the witnesses) but you at
least keep running.

~~~
dnautics
Do these alternative topologies still have the proven guarantees that raft
has?

------
eternalban
[edit: comparative performance to epaxos is noted in OP - a x10 improvement is
claimed.]

Egalitarian Paxos (2013): [https://www.cs.cmu.edu/~dga/papers/epaxos-
sosp2013.pdf](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf)

code:
[https://github.com/efficient/epaxos](https://github.com/efficient/epaxos)

------
nano_o
Great idea and great work!

A couple nitpicks: it would be nice to see what happens when the leader fails.
Optimizing for the case of a stable leader might have impact on recovery time.

Another important aspect for fault-tolerance is whether you can really survive
any minority crashing. For example, if only the strictly necessary number of
nodes keep up with the leader, then if most of those crash the system will
have a really hard time recovering due to the backlog accumulated at slow
nodes which now need to catch up for the system to continue operating.

A performance number that does not take those things into account may not be
very realistic. Nevertheless the idea is pretty good.

~~~
tptacek
Doesn't Multi-Paxos already have stable leaders? My understanding was that the
innovation here was to relay prepare/promise/accept/accepted across a random
relay network.

~~~
nano_o
Yes, it's a nitpick. The comparison to Multi-Paxos seems fair because it makes
similar assumptions (unless re-configuring the relay network after a leader
failure is somehow difficult, but I wouldn't expect that).

My point is that it would be nice to benchmark protocols that take into
account the issues I brought up, and measure what happens in the worst failure
scenarios they are supposed to tolerate. Otherwise we get a false sense of
what performance can be achieved if one really cares about fault-tolerance.

This small issue does not diminish the main contribution of the paper in any
way.

------
hadronzoo
Is this a similar concept to compartmentalized consensus?
[https://mwhittaker.github.io/publications/compartmentalized_...](https://mwhittaker.github.io/publications/compartmentalized_consensus.pdf)

~~~
mad44
The compartmentalized consensus separates follower as acceptor and replica,
and divorces command-log replication from data replication. It also seems to
use two acceptors group as in horizontal scaling.

The compartmentalized consensus does not have relay nodes that do
relay/aggregation. The idea in PigPaxos is simply that using randomized
relay/aggregators have surprising power for vertically scaling Paxos.

The bipartisan Paxos, seems to apply the compartmentalized consensus idea to
EPaxos.

------
zinclozenge
Interesting that there is another Paxos variant that was published to arxiv
around february called BipartisanPaxos
[https://mwhittaker.github.io/publications/bipartisan_paxos.p...](https://mwhittaker.github.io/publications/bipartisan_paxos.pdf).
It too aims at removing bottlenecks.

~~~
mad44
The compartmentalized consensus separates follower as acceptor and replica,
and divorces command-log replication from data replication. It also seems to
use two acceptors group as in horizontal scaling.

The compartmentalized consensus does not have relay nodes that do
relay/aggregation. The idea in PigPaxos is simply that using randomized
relay/aggregators have surprising power for vertically scaling Paxos.

The bipartisan Paxos, seems to apply the compartmentalized consensus idea to
EPaxos.

------
senderista
You can achieve constant write throughput and read latency and linearly
scalable read throughput (at the cost of write latency) with LCR, a criminally
unknown uniform total order broadcast protocol.

