
Gryadka is not Paxos, so it's probably wrong - arjunnarayan
http://tschottdorf.github.io/if-its-not-paxos-its-probably-wrong-gryadka
======
neuronsguy
I'm including the Gryadka author's rebuttal here, for completeness:

Thank you for the analysis of my post but it seems that you didn't get it
correctly. Even to read a value you have to execute the full cycle (prepare,
accept) of consensus (in the case of the stable leader we can skip prepare),
so when you read the nil value the state of the system will be:

A: (value=foo ballot=1 promised=2) B: (value=nil ballot=2 promised=2) C:
(value=nil ballot=2 promised=2)

Not the one you mentioned in the post:

A: (value=foo ballot=1 promised=2) B: (value=nil ballot=0 promised=2) C:
(value=nil ballot=0 promised=2)

So the counter example is incorrect.

I proved the algorithm mathematically by hand and used very aggressive
property based testing with fault injections so I'm pretty confident in its
correctness.

~~~
tschottdorf
I'm not really on wifi good enough to reply extensively, but I pushed an
update to the post explaining this better. If you carry out the read
explicitly, you can still get the anomaly in much the same way. I should've
done so from the beginning, but I tried to simplify the argument and went too
far.

------
mjb
This is a very good analysis.

I think it makes a slightly stronger argument about MultiPaxos than it needs
to. There are other correct ways to use single-decree Paxos, as long as you
recognize it's limitations. The true limitation is "use every register once",
and a log is the gold standard way to do that. Another correct pattern could
be a set-only K/V store, or a store for monotonic finite state machine states
(at the cost of O(N) rounds for every contender). One real-world example is
2PC coordination, where the state machine is very simple, and can be modeled
as an ordered pair of three registers.

~~~
mjb
While we're here, it's also not true that all correct consensus protocols are
Paxos. For example, Viewstamped Replication is correct, but a different
algorithm (see
[https://arxiv.org/pdf/1309.5671v3.pdf](https://arxiv.org/pdf/1309.5671v3.pdf)).
There's a number of correct algorithms, and they all smell an awful lot like
Paxos, but aren't all exactly Paxos. That's the hard part: nearly all Paxos-
like algorithms are broken, but not all.

Also from the "variants of Paxos" department is Flexible Paxos
([https://blog.acolyer.org/2016/09/27/flexible-paxos-quorum-
in...](https://blog.acolyer.org/2016/09/27/flexible-paxos-quorum-intersection-
revisited/)) which changes the rules about how to select a quorum.

~~~
Scaevolus
EPaxos (essentially leaderless Paxos) is interesting as well, but I'm not sure
how much deployment it's seen:
[https://github.com/efficient/epaxos](https://github.com/efficient/epaxos)

The repo includes a TLA+ model too, which will hopefully become a trend in
newly proposed distributed systems in the future!

Cassandra looked into implementing it, and reached a prototype implementation
with ~60% better performance, but it looks like the contributor didn't
continue driving it:
[https://issues.apache.org/jira/browse/CASSANDRA-6246](https://issues.apache.org/jira/browse/CASSANDRA-6246)

~~~
mring33621
Five bucks says that 'the contributor' got a job at ScyllaDb.

~~~
sulam
Apple, actually.

------
RcouF1uZ4gsC
[http://www.allreadable.com/5b354QWp](http://www.allreadable.com/5b354QWp)

"every consensus protocol out there or every fully distributed consensus
protocol is either Paxos or Paxos with cruft or broken" \- Mike Burrows

~~~
irfansharif
that was a great talk, thank you for sharing!

------
grogers
I haven't done any in depth analysis of gryadka but I think the premise of the
argument here may be wrong. Even if a particular value isn't accepted by a
majority of nodes, that doesn't necessarily make it a dirty read. As long as
anyone participating in the algorithm sees the history as if it was committed
right before the cas result is committed, it could still be linearizable. I
would need to create a formal model to be sure whether it is correct or not,
but don't just assume it's wrong because it isn't paxos (which it isn't and it
shouldn't advertise like it is).

~~~
mattb314
But depending on node failures, couldn't the same client successfully run
cas(nil, A) -> cas(A, B) -> cas(nil, C), with all operations succeeding? Say
the first two operations only succeed in writing to a single node (as in the
post's example). Then if that single node goes down, the third cas will
succeed, which is certainly not linearizable. Note: I'm working of the
assumption that the system is correctly described by the posts author. The
original author of Gryadka has disputed the description in the post, and I
haven't read the source.

------
norswap
It would have been nicer to contact the authors of the algorithm before making
these (apparently incorrect) claims via blog post. It's just sensationalistic.

------
gtrubetskoy
For those wondering, "gryadka" is "bed" as in vegetable/garden/flower bed in
Russian.

~~~
inlineint
And Redis, multiple instances of which are supposed to be plugged into
gryadka, means "radish" in Russian :)

~~~
fwefwwfe
Sounds Mario 2 themed.

------
sargun
I think the most interesting thing the article states is the following
question (and proposed answer):

Is it possible to get compare-and-swap without the log?

TL;DR: I don’t think so.

Riak Ensemble is such a system that utilizes single-decree Paxos in order to
try to achieve CAS:
[https://github.com/basho/riak_ensemble](https://github.com/basho/riak_ensemble)

If the only the distinguished leader is allowed to publish a proposal for
value change, then this allows you to preserve the CAS invariant.
Unfortunately, this comes at the sacrifice of liveness. If you then have a
second Paxos group to elect the leader, then this removes the liveness issue.
On view change, you have to contact a quorum of nodes, and re-propose a new
epoch in order to avoid issues.

I'm curious if someone can poke a hole in riak_ensemble's algorithm.

------
elvinyung
I know that consensus is probably the cleanest way to guarantee very strict
correctness, but I'm kind of not convinced that synchronous replication is the
ideal solution for HA. I have yet to see clear evidence that teams at less-
than-Google scale aren't mostly getting by with semisynchronous replication,
or even just asynchronous replication.

Always running an ensemble of three or five nodes for each shard seems to be
pretty overkill (and expensive, both in terms of money and latency),
especially if it's mostly just a way to do automated failovers. I sometimes
wonder if there's a good-enough cheaper semisync alternative.

~~~
electronvolt
You get other advantages for doing three/five node replication--online updates
are free if you require (n-1) compatibility.

Reality is that if you need to hit 5+ nines and require strong consistency
guarantees (like "we never lose our customer's data") at data center scale,
you'll probably need something similar. Most people probably don't need those
guarantees, however.

~~~
elvinyung
I should have clarified -- I meant mostly for things like OLTP databases.

I see why Google built Megastore, Spanner, etc. And sure, ZooKeeper or etcd
makes sense for putting small amounts of configuration data.

But most of us aren't Google, and synchronously replicating the entire OLTP
database, 3 or 5 nodes times the number of shards, seems kind of absurdly
expensive for most people.

~~~
cube2222
A node can handle many more than one shard. So you can basically have 10
nodes, 8 shards with 5 replicas each. (Just an example)

~~~
elvinyung
Oops, you are correct -- I somehow forgot that shards are logical, not
physical.

------
nsxwolf
That article was like stepping foot on an alien world. I have never even heard
of a single one of those things.

~~~
mindcrime
If you're interested, read these:

[https://en.wikipedia.org/wiki/Consensus_%28computer_science%...](https://en.wikipedia.org/wiki/Consensus_%28computer_science%29)

[https://en.wikipedia.org/wiki/Two_Generals'_Problem](https://en.wikipedia.org/wiki/Two_Generals'_Problem)

[https://en.wikipedia.org/wiki/Byzantine_fault_tolerance](https://en.wikipedia.org/wiki/Byzantine_fault_tolerance)

[https://en.wikipedia.org/wiki/Paxos_%28computer_science%29](https://en.wikipedia.org/wiki/Paxos_%28computer_science%29)

~~~
nsxwolf
So is this like the sort of stuff in newfangled cars, where you hear about
automatic breaks, and they say "There's 4 computers and at least 3 of them
have to agree on the sensor data before they slam the breaks"... kind of
thing?

~~~
peterwwillis
tl;dr1 Yes, read the paper I link at the bottom, or this
[http://users.ece.utexas.edu/~bevans/courses/ee382c/projects/...](http://users.ece.utexas.edu/~bevans/courses/ee382c/projects/fall99/curtis-
france/talk.pdf)

tl;dr2 when you read "Paxos" or "distributed"-anything just assume it's a
bunch of nerds arguing about theoretical problems inside a box

Recap of those wiki pages:

\- Consensus is many different ways to solve similar problems, all of which
involve agreeing on a solution, to different degrees.

\- The two-generals' problem is a proved unsolvable problem that says you
can't know what two generals are going to do if you have no way to prove
communication between them is valid (or happens at all).

\- Byzantine faults are basically the same thing, but 10 people instead of 2.

\- Paxos is a bunch of ways of proving various degrees of different kinds of
consensus with different uses that works most of the time. (also known as "the
algorithm family that comp sci majors keep trying to improve on, but then a
genius points out that any changes make it work differently, and therefore
must suck")

Why do we care? So you can use 5 servers all around the world, store random
messages on them, and be sure that a bunch of them have the information you
want, that it is correct, that it will still be available somewhere if one of
them goes down, and that if one comes back up with shitty data, it will get
corrected and become good data again.

The reason why this is hard is we assume that the messages:

1\. have no CRC (integrity, "this data is not corrupt")

2\. or digital signature (integrity+authenticity, "this message is not corrupt
and definitely came from Bob")

3\. and that even if they did, it could have been 3a. "wrong" before it was
signed, or 3b. "faked" so that we can't tell when a CRC or sign is correct,

4\. or the message never arrived,

5\. or that random valid messages have been saved on one server and that when
it rejoins the group it now has data that, regardless of it being good or bad,
we don't want in the rest of the pool of servers because now the rest have to
incorporate it, and what if there are conflicts

The Paxos algorithm family seeks to solve most of this in one go, with
exceptions.

Do you need Paxos to solve all those problems? No. Do you need it to solve
those problems in the theoretical boundaries of comp sci nerds? Yes. Do you
need to know how it works? No. Are there situations where a Paxos network can
simply stop working? Yes. Do you _need_ to use Paxos for something you're
working on? Probably not. Is it possible Paxos will not solve your problem?
Yes. And do you still need plans to back up and restore your data in the event
of catastrophic failure? Definitely.

The only paper that I know of currently that I would recommend anyone to read
on byzantine failures and their solutions is this one:
[https://www.cs.indiana.edu/classes/p545/post/lec/fault-
toler...](https://www.cs.indiana.edu/classes/p545/post/lec/fault-
tolerance/Driscoll-Hall-Sivencrona-Xumsteg-03.pdf)

(Granted, I don't know what I am talking about, so take all that with a grain
of salt)

------
jolux
What about Raft?

------
vittore
Can we have TLA+ for Gryadka done by some?

~~~
rystsov
I don't know TLA+ yet but I'll assist anybody with an explanation on how
Gryadka works.

------
whatnotests
Everybody upvoting this without any discussion.

I suppose "It's not Paxos, so it's probably wrong" is beyond discussion (not
that I disagree, of course).

~~~
skybrian
Is it? What about Raft?

------
jancsika
"If it’s not Paxos, it’s probably wrong."

Is Bitcoin Paxos?

~~~
kcudrevelc
It's not a (provable, guaranteed) CP system. Bitcoin attempts to provide a
global consensus across a large number of actors, but its attempt is based on
proof of work, specifically that it's probably difficult to compute hashes
with a certain property (leading zeros, last I checked). It in no way
guarantees consistency in the face of partitions.

Consider the simplest case: 1/2 of bitcoin users/miners are temporarily split
off from the other half, for say a week (all atlantic/pacific fibers are
broken at once, all satellites fall out of the sky, other huge catastrophes
occur all simultaneously). Each half would append their own blocks to the
blockchain happily, and depending on the chain lengths added, once the
partition went away there would be no good way to reconcile. Thus: not
consistent in the face of partitions.

~~~
jancsika
> Each half would append their own blocks to the blockchain happily, and
> depending on the chain lengths added, once the partition went away there
> would be no good way to reconcile.

Why do you say that? The chain with the greatest total difficulty wins.

~~~
nordsieck
That is not an acceptable form of reconciliation in a CP system.

~~~
eternalban
Bitcoin is AP not CP.

~~~
Dylan16807
The hard problem is consistency, in terms of tradeoffs and implementation.
Pure AP is easy.

