
WPaxos: a wide area network Paxos protocol - ingve
https://muratbuffalo.blogspot.com/2017/12/wpaxos-wide-area-network-paxos-protocol.html
======
itcmcgrath
Need some non holiday time to read this fully, but it's worth noting that
aside from Spanner, Cloud Datastore (using Megastore tech) and Cloud Firestore
(using the same tech as Cloud Spanner) all do Paxos across several data
centers over a wide area and achieve millions non batched writes/second.

Sometimes it can be hard to tease out the true trade-offs in the systems as
they don't get as detailed acknowledgement, but from a quick read it looks
like tail latency will will be worse.

Improvements in consensus algos often have issues, correct implementation is
hard, but it can make previously impossible things tenable which I find
exciting.

~~~
csears
The fact that Google is able to achieve good write performance with standard
Paxos over wide areas is a direct result of their advanced network
infrastructure and low latency data center interconnects. I imagine there are
lots of people without a Google-class network who would benefit from WPaxos or
something similar.

~~~
itcmcgrath
Our network definitely makes a big difference, but even we don't just use
straight up vanilla Paxos.

------
ccleve
It looks like the core of the idea is "object stealing". In most consensus
algorithms you'll have a single leader, worldwide, that is in charge of
sequencing all objects. Either that, or you partition objects across leaders
and then coordinate across partitions.

In this algorithm you have multiple leaders, where a leader can be in charge
of a particular object temporarily by "stealing" it. This brings control to
the datacenter that needs it, reducing latency.

It's an interesting idea, and I haven't fully absorbed the paper. But I'll
comment on it anyway. It's always easier to criticize when you're ignorant of
the details :)

In general, I don't like these algorithms that only guarantee a partial
ordering. In other words, instead of having a single global total ordering of
all transactions, you have an ordering of operations that apply to a single
object or key or a related group of keys. This means you can't know if object
A got updated before object B. You can't know the total state of the system as
of a snapshot in time, or as of the moment a node membership change happens.
(Maybe. There might be workarounds for these issues, at the cost of some
complexity.)

This paper is a good contribution, but we ain't there yet.

~~~
matthelb
I also haven't read through the technical report in detail, but in skimming
through parts of it, it looks like WPaxos is supposed to provide
linearizability. More specifically, the authors state that:

> WPaxos maintains separate logs for every object and provides per-object
> linearizability.

Linearizability is a composable correctness condition, so per-object
linearizability implies that the composition of all the objects is
linearizable. Thus, WPaxos does in fact provide a single total ordering of all
operations. Nearly all of the replicated state machines via consensus
protocols (Paxos, EPaxos, Mencius, Generalized Paxos, etc.) provide this same
consistency level.

~~~
wickawic
Time to let my ignorance shine:

> Linearizability is a composeable correctness condition

So if all I know is that A1 precedes A2 and that B1 precedes B2, how do I make
assertions about the ordering of A2 and B2? Is “linear” in the context that
you are talking about more specific than “orderable”?

GP isn’t complaining that the algorithm is incorrect, just that it doesn’t
have the property of being replayed or reversed in a deterministic way. Most
applications don’t need this property, I am just trying to see if I’m missing
a part of your argument.

~~~
matthelb
Maurice Herlihy and Jeanette Wing introduced linearizability back in 1990 [1].
It stipulates that the result of a concurrent execution of the system is
equivalent to a result from a sequential execution where the order of the
sequential execution respects the real-time order of operations in the
concurrent execution.

In this context, you could probably use "linear" and "orderable" to mean the
same thing. The idea is that, for any linearizable execution, there exists an
equivalent total order of the operations. Because there exists a total order,
if a client C1 sees A2 before B2 then a client C2 cannot see B2 before A2.

[1]
[https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf](https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf)

~~~
pests
You still haven't answered the question though.

The definition of linearizability is not in question.

The question is does this specific system we are discussing have global-
linearizability or is there separate per-object (or some other bucketing
method) linearizability.

~~~
matthelb
Sorry, I was trying to give the GP more context for my initial comment.

As I said in my initial comment, the authors state that WPaxos provides per-
object linearizability, which implies that it provides (to use your term,
"global"-)linearizability.

~~~
itcmcgrath
It doesn't, the authors are calling this out to limit the scope of
linearizability to per-object. You're thinking about strict serializability.

The authors mention that transactions can be implemented in a way to support
this by including a first step to steal all objects needed by the transaction.
I believe this would give you strict serializability at a global level, but
does have trade-offs in needing to know all objects at the start of a
transaction, and other implications on latency, etc from this step.

The authors also note they haven't implemented transactions yet.

Also worth noting that some of the extensions/optimizations, such as the
mentioned 'locality adaptive object stealing optimization' won't work as
described with multi-object transactions. I could see additional work to
identify groups of objects, but this would be very workload specific and not
suitable as a generic solution.

~~~
matthelb
I'm not thinking of strict serializability - I do actually mean
linearizability.

My previous comments were referring to the basic algorithm presented as the
main contribution of this paper in section 3. In this algorithm, "Every
command accesses only one object o." With each operation applying to a single
object and each object satisfying linearizability, the system as a whole will
be linearizable.

The distinction between linearizability and strict serializability is somewhat
subtle. I highly recommend this blog post [1] by Irene Zhang and this blog
post [2] by Peter Bailis for some really great discussion on the subtleties
involved.

[1]
[https://irenezhang.net/blog/2015/02/01/consistency.html](https://irenezhang.net/blog/2015/02/01/consistency.html)

[2] [http://www.bailis.org/blog/linearizability-versus-
serializab...](http://www.bailis.org/blog/linearizability-versus-
serializability/)

~~~
ccleve
Perhaps we should step away from the terms "linearizability" and
"serializability" and speak of global total ordering. That's what some other
systems provide and this one doesn't. It's a valuable feature because it makes
it possible to know the state of the system as of a snapshot in time.

~~~
matthelb
This is exactly what I was addressing in your initial comment! This system
provides a global total ordering of operations. It's the whole point of a
replicated state machine.

The ensuing conversation explored how/why the system provides a global total
ordering.

------
justinsb
I think it's a huge positive development that there's a reference
implementation as well, in go. At least helpful for evaluation, possibly for a
production implementation also.

------
iambvk
What are example real world applications and/or use-cases where number of
acceptors must be more than 3 or 5? Thanks.

