
Distributed consensus reloaded: ZooKeeper and replication in Kafka - nehanarkhede
http://www.confluent.io/blog/distributed-consensus-reloaded-apache-zookeeper-and-replication-in-kafka
======
ahachete
Very insightful post. My main concern with this approach is that having the
ISR a fixed size (at any given time), as the post explains, it is subject to
latency spikes if a node of the ISR is slow. This seems to me it would happen
a lot in real cases, and its a major drawback.

So while providing a lot of flexibility and decreasing the communication
overhead when using a lot of nodes, it may introduce performance degradation
and latency spikes when nodes are slow. I'd probably prefer a design where
some kind of reduced majority is required, but not a full list of nodes, so
that outliers would be filtered out.

Are there good percentile numbers over there to check metadata writes over
the, say, 99,99%-99,999% percentile, to check how bad this may become?

~~~
fpj
I'm happy you enjoyed the post and thanks for the comment.

If I understand your comment correctly, it isn't the case that the ISR is
fixed. The ISR has a minimum size, but it can change over time, so brokers can
be removed from the ISR and they can rejoin later once they catch up.

~~~
ahachete
Yes, I got that, maybe I wasn't clear enough. My main concern is that slow
nodes (rather than failing nodes) may easily provoke latency spikes, and that
seems to me a quite frequent situation. The good point about quorum writes is
that outliers are ignored, but with the ISR, as you need to wait for all of
them, outliers are not ignored (until, maybe, removed from the ISR). I
understand the advantages and the compromise here. But I would like to see if
this is a good compromise, as outliers may have a big impact.

~~~
fpj
Got it, yeah, quorum systems have higher tolerance to tail latency, there is
no question about it. we do mention it briefly in the post, but we don't have
numbers to show. I'm not aware of it being a major concern for kafka
deployments, but I can say that for Apache BookKeeper, we ended up adding the
notion of ack quorums to get around such latency spikes. I'll see if I can
gather some more information about kafka that we can share at a later time.
Thanks for raising this point.

~~~
ahachete
That would be awesome to have some numbers about this topic. Thanks for your
interest. I guess with other systems like Paxos it could be solved by
separating the notion of learners and acceptors, which are usually collapsed
in the same nodes. In this case, you may have more learners than acceptors
(solving the N^2 communication growth with the number of nodes) while still
solving the tail latency by running quorum among the acceptors.

------
nano_o
It seems similar to the primary-backup instance of the Vertical Paxos family.
In the primary-backup Vertical Paxos, one can tolerate f faults with f+1
replicas as long as a reliable external reconfiguration master is there to
replace failed replicas and make sure everyone agrees on the configuration.
Here the external reconfiguration master would be ZooKeeper and the primary-
backup protocol the ISR protocol.

[http://research.microsoft.com/en-
us/um/people/lamport/pubs/v...](http://research.microsoft.com/en-
us/um/people/lamport/pubs/vertical-paxos.pdf)

~~~
mjb
Yeah, there are lots of similarities to Vertical Paxos. It also shares a lot
van Renesse and Schneider's classic Chain Replication paper
([https://www.usenix.org/legacy/event/osdi04/tech/full_papers/...](https://www.usenix.org/legacy/event/osdi04/tech/full_papers/renesse/renesse.pdf)),
with the use of a configuration master to configure a chain that can't solve
uniform consensus by itself.

That's not a criticism at all, though. There are a lot of good design and
systems ideas there, and they are well explained in this post.

~~~
fpj
I'm glad you enjoyed the post. The chain replication work indeed relies on a
paxos-based master service for reconfiguration, but I believe that's the only
similarity, the replication scheme is different otherwise.

------
dwenzek
I find this post a bit disappointing.

The subject is exciting and I wish to learn more on how to design an efficient
and safe replication scheme on top of two coordination protocols, each with
its core set of garanties and constraints.

The post gives a good overview of the trade-off between safety and performance
made with _in-sync replicas_ (ISR) compared to quorum acknowledgement.

But it remains very vague on how to deal with the problem stated in the
"ZooKeeper and consensus" section: _being consistent does not mean that the
values read [by two workers] are the same necessarily_ but are only computed
after consistent growing prefixes of the sequence of updates. I totally fail
to understand the way proposed _to break the tie_.

I would expect the schema better shows late replica and even late views of the
ISR. I would expect more evidences on how the system ensures a message
produced to a consumer is never retracted.

~~~
fpj
Thanks for your comments, and I'm sorry that you feel that the post does not
match your expectation. If you write me directly, I'll be more than happy to
clarify any question you may have. I'm not sure, for example, what is
confusing you about the "zookeeper and consensus" section. I'm also not sure
what kind of evidence you're after on published messages being lost.

~~~
dwenzek
Thanks to propose your help !

What I find vague in "ZooKeeper and consensus" is the answer to "Why does this
proposal work compared to the original one?":

>> Because each of the workers has “proposed” a single value

Does the processes read or propose the value ?

>> and no changes to those values are supposed to occur.

Not suppose to ? How do you ensure that ?

>> Over time, as the configurator changes the value,

>> they can agree on the different values by running independent instances

>> of this simple agreement recipe.

Sorry, but I fail to see what recipe you speak about.

~~~
fpj
>> Does the processes read or propose the value ?

It does both, it first proposes by writing a sequential znode and then reads
the children (all proposals are written under some parent znode). This is
certainly assuming some experience with ZK, and I wonder if that's the
problem. It was not the goal to go into a discussion of the ZooKeeper API, but
I'm happy to clarify if this is what is preventing you from getting the point.

>> Not suppose to ? How do you ensure that ?

It is not supposed to in the sense that if this is implemented right, then the
proposed values for each client won't change. The recipe guarantees it because
it assumes that each client writes just once.

>> Sorry, but I fail to see what recipe you speak about.

I'm referring to these three steps: creating a sequential znode under a known
parent, reading the children, picking the value in the znode with smallest
sequence number.

~~~
dwenzek
Thanks for these explanations ! Things are much more clear now.

Indeed, I have no experience in programming zookeeper, even I use it a lot
behind tools like kafka or storm.

------
zenbowman
Well written and understandable. The use of illustrations was particularly
good.

~~~
nehanarkhede
Thank you!

