
Say No to Paxos Overhead: Replacing Consensus with Network Ordering - kushti
https://blog.acolyer.org/2016/12/08/just-say-no-to-paxos-overhead-replacing-consensus-with-network-ordering/
======
mritun
Interesting research paper but multicast is a non-starter for non-trivial
production systems.

1\. As implemented by most (read: all) network hw, it works fine when
congestion is low. As soon as congestion starts increasing, the packet loss
amplifies.

2\. If you have large multicast groups, the bloom filters used for membership
test become very full, hence false positive rate rapidly increases. This leads
to multicast storms.

In my experience, anything which is not TCP (or, to lesser degree UDP) is a
non-starter for building reliable distributed systems. Despite the claims of
"lower overhead" the TCP protocol stack gets most love and hence has been
optimized heavily on every significant operating system.

~~~
jerf
But we're not implementing multicast in some general sense, then pushing it to
its limits as we try to scale it to thousands of devices. Isn't that the
multicast failure case? What we're doing here is replacing something that
would right now be using Paxos, which has a lot of message chatter and
therefore can become bandwidth limited probably sooner than the multicast
channel would be anyhow. You wouldn't use this to try to coordinate thousands
of machines either way, you'd be doing this to replace something that ran at
Paxos-scale previously. There's a bound here, and might it be below the
multicast failure bound?

(Honest question; I have no direct experience with multicast.)

~~~
mritun
Multicast fails before bandwidth starts becoming an issue. Collisions and
congestion are fact of life and do not necessarily mean that links are full
(you may just have lots of packets stuck close together in time).

Most TCP implementations handles it well and Linux has well written guides on
how to tune parameters when the defaults don't match well to your specific
situation well. Multicast just fails catastrophically in arcane ways that
depend upon your specific hardware, network topology and mix of your firmware
version on your switch/routers. It makes it very hard to "design" a product
around that kind of fragile system.

------
otoburb
Using the correct capitalization makes more sense for the HN title because
"NO" in this instance and context means "network ordering", which the article
then discusses (even though the expansion is at the end of the article title).

~~~
IshKebab
It's a pun. Although you are right, they have kind of ruined it by using No
instead of NO.

------
hardwaresofton
"NOPaxos" might me a catchy name but is probably not good in the long term as
it might make people think it's NOT paxos. Maybe "NeOPaxos" is better?

Also If my understanding is anywhere near correct, it's a mix of Paxos and a
bit of a proxying sequencer (which may be implemented in a physical switch)
and all replicas & master need to be on the same switch/subnet...

Going to read the paper now -- anyone have a better understanding of the tech
they could share? Excited but still a bit skeptical

~~~
qznc
"NoPaxos" is not "Not Paxos" like "NoSQL" is "Not SQL" ;)

If you use a NoSQL database, the relations do not disappear. You just have to
implement your joins in a higher layer. NoPaxos pushes some parts of Paxos
into lower levels, namely the OUM primitive of the network.

~~~
hardwaresofton
Forgive me if I'm wrong, but aren't they still pretty similar (in that they
sound like misnomers?)

Relations ~= Consensus (traditional full-featured Paxos)

To use your sentence

If you use NOPaxos, the consensus doesn't disappear. You just have to perform
it whenever your sequencer, or replica-leader becomes unreachable/fails.

The "OUM primitive" addition isn't much more than two monotonically increasing
integers (session and order-number) which depend on a central (bottleneck)
sequencer... NoSQL moved the responsibility for relation-management up,
NoPaxos is moving the responsibility for consensus down/to the worst case...
My point was that neither are fundamentally different from the "X" after the
"NO"...

------
toast0
I haven't done much with Paxos and friends and I only read the blog, not the
paper, but this seems like elect a coordinator, send all changes through the
coordinator which creates an ordering and sends out the ordered changes to the
group. And by the way, hide the coordinator in your switch.

This seems pretty straightforward -- running MySQL replication in a sane way
is essentially appoint a master server, send all writes through that, and hope
you don't need to have a new election.

~~~
altendo
This is characteristic of all distributed systems, so in that sense it isn't a
unique concept to NOPaxos. The emphasis here is minimizing the traffic that a
typical Paxos implementation incurs while changing the constraints of the
distributed system.

------
siscia
In the Second to the last figure (fig 5) you can see a behaviour that is
extremely weird in my opinion, the latency is almost in O(1) and then hit a
wall and goes up as exponential (I guess).

It this common? What cause this behavior?

~~~
mritun
That's multicast for you. Once you congestion and/or group size increase
beyond a certain value or group size, things go /really/ bad; as in you may
need to "shutdown, wait and reboot" your network to recover.

~~~
tssva
All of the tested protocols including unreplicated flows follow the same
pattern with the proposed multicast protocol most closely following the same
pattern of behavior as the unreplicated data; therefore some intrinsic
property of multicast is not responsible.

------
toolslive
There are other consensus algorithms that start with a monotonous channel (ie
a channel that might drop messages, but does not reorder them). Mencius fe
does this and also avoids most of the overhead
[http://www.sysnet.ucsd.edu/sysnet/miscpapers/mencius-
osdi.pd...](http://www.sysnet.ucsd.edu/sysnet/miscpapers/mencius-osdi.pdf)

------
matthewaveryusa
I'm actually really curious about this. If you are geo-diverse and need to go
over the internet, will this work?

~~~
hardwaresofton
It It depends on what your definition of "work" is.

if you're geo-diverse, and you want writes to happen are more than one place
(as in the top nodes at each geographic location can all process requests),
then you have to go the traditional consensus route. One server in one region
makes a write, and the other servers in the region need to acknowledge the
write before they move on, lest state become inconsistent. Pretty sure there's
no way around it.

if you're geo-diverse, but are OK with possibly stale reads from time to time,
and only writing to one geographic area (server), then it seems possible --
you'd just apply these same principles across a bigger scale rather than same-
datacenter. You'd probably also lose a lot of the latency gains though.

------
socmag
This gets a huge thumbs up from me.

I've worked through a lot of the same reasoning as these people, and as
technically correct as PAXOS might be, in the real world, causality at high
frequency is a blur at best, and there is absolutely no need to impose the
hard constraints that PAXOS and others imply.

The only cases where guaranteed constraint systems are needed are when you
don't have traffic and at each time step all parties are capable of asking
each other to confirm that from all frames of reference they all agree who and
why someone committed first. Which is great, but makes zero sense in the real
world.

At scale transactions are and should be committed on a best effort basis.

You don't look at your bank account and wonder why the T-Mobile payment went
out slightly after your ATM withdrawal.

Why? Because... shrug. God said so. He works in mysterious ways.

This is awesome. Sign me up.

~~~
pron
You have misunderstood (as the title is a pun). It is "Say NO to Paxos
Overhead", and is about a new Paxos invariant called NOPaxos (Network Ordering
Paxos), that relies on code running in network switches to provide ordering.
The same consensus guarantees apply -- in fact, it is still Paxos -- but the
latency is significantly improved (and is within 4% of no replication at all).
So it's Paxos with (almost) no overhead thanks to NO.

You are also very mistaken about there being "no need" for consistency
guarantees. Even systems that don't provide consistency for _every_ data
update, do enjoy consistency for some more rare events (like cluster
membership). Without strong consistency _at all_ , the range of distributed
programs we can build is much reduced. The need for a greater range of
distributed programs arises not from geographically separated applications
(like ATMs), but from the mere fact that distributed computing is currently
our only way to scale computational power ( _within_ that data center). If we
cannot have consistent distributed computations, our computational power and
the range of applications we can build in general would be greatly reduced.

~~~
socmag
I was definitely NOt saying there is no need for consistency guarantees at
all. I was saying that this is a much better way to do it.

