
KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum - telekid
https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum
======
mayank
Note: this is not about replacing ZooKeeper in general with Kafka as the title
might suggest, it's about Kafka considering an alternative to its internal use
of ZooKeeper:

"Currently, Kafka uses ZooKeeper to store its metadata about partitions and
brokers, and to elect a broker to be the Kafka Controller. We would like to
remove this dependency on ZooKeeper. This will enable us to manage metadata in
a more scalable and robust way, enabling support for more partitions. It will
also simplify the deployment and configuration of Kafka"

~~~
sonabinu
The current heading is ambiguous as well.

~~~
jimktrains2
Honestly, the new one is even less informative than the original.

~~~
tus88
Well if the original gave the impression that kafka and zk were the same kind
of software it must have been pretty bad.

------
kevsim
In the days before hosted Kafka implementations were readily available, I
tried to set up Kafka in our AWS infrastructure. Getting Zookeeper working in
a world based on autoscaling groups was a nightmare. It felt like it was built
for the days where each server was a special snowflake.

Looking forward to seeing if this gains traction.

~~~
rad_gruchalski
This is just an FYI type of comment. with zookeeper, one needs an odd number
of machines, usually not greater than seven. Servers are indexed from 1 to n.
They’re replaced one for one, so the server id matches. Scaling it up and down
is a bit of a pita because every server needs to know all other servers.
Changing the ensemble members requires restarting every zookeeper instance
because zookeeper doesn’t support config reload.

Zookeeper does expect a single “host” for every server but it runs perfectly
fine with docker. But that’s no different from consul or etcd.

~~~
thomaslee
Another FYI type comment I guess. :) Some of my more gripey statements here
may be outdated info, so DYOR I guess.

Dynamic reconfig in 3.5 addresses the "restarting every zookeeper instance"
problem. [0] You stand up an initial quorum with seed config, then tie in new
servers with "reconfig -add". Not sure how well it would tie into cloudy
autoscaling stuff though. I wouldn't start there myself.

A much bigger pain IMO is the handling of DNS in the official Java ZK client
earlier than 3.4.13/3.5.5 (and by association, Curator, ZkClient, etc.). [1]
The former was released mid 2018 and the latter this year, so tons of stuff
out there that just won't find a host if IPs change. If you "own" all the
clients it's maybe not a problem, but if you've got a lot of services owned by
a ton of teams it's ... challenging.

Even with the fix for ZOOKEEPER-2184 in place I'm pretty sure DNS lookups are
only retried if a connect fails, so there's still the issue of IPs "swapping"
unexpectedly at the wrong time in cloud environments which can lead to a ZK
server in cluster A talking to a ZK server in cluster B (or worse: clients of
cluster A talking to cluster B mistakenly thinking that they're talking to
cluster A). I'm sure this problem's not unique to ZK though.

Authentication helps prevent the worst-case scenarios, but I'm not sure if it
helps from an uptime perspective.

TL;DR: ZK in the cloud can get messy (even if you play it relatively "safe").

[0]
[https://zookeeper.apache.org/doc/r3.5.5/zookeeperReconfig.ht...](https://zookeeper.apache.org/doc/r3.5.5/zookeeperReconfig.html)
[1]
[https://issues.apache.org/jira/browse/ZOOKEEPER-2184](https://issues.apache.org/jira/browse/ZOOKEEPER-2184)

~~~
rad_gruchalski
Good to know! Thank you! Haven’t dealt with zk in detail since ~2016.

------
mykowebhn
Isn't managing consensus extremely hard to do? Wouldn't one want to rely on a
proven solution rather than spinning up a new solution?

~~~
Plugawy
The document mentions using Raft for consensus and coordination - it's the
same approach used by Etcd, Consul, Serf, RethinkDB and other systems. AFAIK
it's easier to implement and understand than Zookeeper's consensus solution.

[https://raft.github.io](https://raft.github.io)

~~~
ypcx
Kafka was thinking about etcd:

[https://cwiki.apache.org/confluence/display/KAFKA/KIP-273+-+...](https://cwiki.apache.org/confluence/display/KAFKA/KIP-273+-+Kafka+to+support+using+ETCD+beside+Zookeeper)

Considering most Kafkas will probably run in Kubernetes at some point, they
could have shared the etcd used by Kubernetes.

~~~
Plugawy
Sounds like a bad idea, no? Last ting you want is whole cluster going down
because Kafka is misbehaving

~~~
ypcx
I don't know. First, there may be some separation achievable - but I'm just
guessing. Second, if Kafka is your primary workload, you don't want it to
misbehave in any case. But of course I get your point. I'm just saying I've
seen people thinking about it that way in some Github Issues.

------
noahdesu
This sounds really similar to some of the work coming out of Vectorized on
Redpanda [0]. They're building a Kafka API compatible system in C++ that's
apparently achieving significant performance gains (throughput, tail latency)
while maintaining operational simplicity.

[0]: [https://vectorized.io/redpanda](https://vectorized.io/redpanda)

~~~
shaklee3
Where do you see performance comparisons to Kafka? As far as I can tell, the
product you linked doesn't exist outside of a blog, so this reads like a
marketing post.

------
linuxhansl
I don't quite understand why everybody and their mother are trying to remove
Zookeeper from their setup.

In my past I've seen this many times and each time people went back to
Zookeeper after a while, because - as it turns out - consensus is hard; and
Zookeeper is battle hardened.

~~~
atombender
First, because it's yet another dependency. Consensus-based systems like
CockroachDB, Dgraph, Cassandra, Riak, Elasticsearch, ActorDB, Rqlite,
Aerospike, YugaByte, etc. are wonderfully easy to deploy because they can run
with no external dependencies. (Their consensus protocols may have different
problems, but that's beside the point.)

Secondly, it's a pretty _heavy_ dependency. The JVM is RAM-hungry and it's
difficult to ensure that it always has enough RAM. Running multiple JVM apps
on a single node must be done carefully to make sure each app has enough
headroom. It consumes considerably more RAM than Etcd and Consul.

Thirdly, I think it's fair to say that ZK is showing its age. It's notorious
for being hard to manage (see the other comments in this thread), with a
fairly old design (based on the now-ancient Google Chubby paper) that, while
resilient, is also less flexible than some other competing designs.

------
simtel20
Is no-one else running into FastLeaderElectionFailed? When you have a system
that writes a lot of offset/transaction info to zookeeper you can push the
zxid 32-bit counter to rollover in a matter of days. When this happens it can
bring zookeeper to a grinding halt for 15 minutes after 2 nodes try to
nominate themselves for leadership and the rest of the cluster sits back and
waits for a timeout.

[https://issues.apache.org/jira/browse/ZOOKEEPER-2164](https://issues.apache.org/jira/browse/ZOOKEEPER-2164)

[https://issues.apache.org/jira/browse/ZOOKEEPER-2791](https://issues.apache.org/jira/browse/ZOOKEEPER-2791)

Requests (can't find them in JIRA at the moment, so I need to paraphrase) in
the past to have a call to initiate a controlled leadership move to another
node have been turned down as "you don't need this" yet leadership election
fails in some circumstances! In addition there's no command or configuration
to disable FastLeaderElection.

So the zookeeper maintainers keep operators limited to having to flip nodes
off and on again, which is really a bad way to manage software because it
impacts clients as well as leadership (and even if clients recover, most code
that I've seen like to make some noise when zk connections flap). I would
really like to eliminate all use cases for zookeeper where there is a chance
that the zxid will exceed the size of its 32-bit counter component in the span
of, say, a decade so that as an operator I don't have to set alerts on the
zxid counter creeping up, and having to reset zookeeper and restart all of its
clients (many versions of many zookeeper clients don't retry after connection
loss, don't retry after a timeout, don't cope with the primary connection
failing, will have totally given up after 15 minutes, etc.).

I think that the kafka maintainers have been doing a better job of actively
maintaining their code and ensuring it works in adverse conditions, so I'm on
board with this proposal.

Zookeeper isn't magic, it's just pretty good at most of what it does, and I
think that projects that understand when they've pushed zookeeper into a bad
corner may benefit from this kind of move, if they also have a good idea of
how they can do better.

------
james-mcelwain
There's a toy Kafka implementation written in Go that attempts to do this:
[https://github.com/travisjeffery/jocko](https://github.com/travisjeffery/jocko)

Previous HN discussion:
[https://news.ycombinator.com/item?id=13449728](https://news.ycombinator.com/item?id=13449728)

~~~
rad_gruchalski
The author now works for Confluent.

------
MrBuddyCasino
To all the people wondering why "replace battle tested ZK" and how "consensus
is hard": its right there under the motivation header:

> enabling support for more partitions

I don't know if anyone of you ever ran a high-throughput Kafka cluster with a
large number of partitions (as in, thousands of them), but its not pretty.
Rebalancing can easily take half an hour after a rollout, and throughput is
degraded during that time. We recently had to move to shared topics because it
became untenable.

This is a very welcome change!

------
camiloaguilar
I’m not too convinced of the approach. I’ve been anxiously waiting for
[https://vectorized.io/](https://vectorized.io/) to release their message
queue. It is built in modern C++, uses ScyllaDB Seastar framework to do IO
scheduling in userspace, with better mechanical sympathy. And like Hashicorp’s
Nomad and Vault, which I’m a fan of, it has built-in distributed consensus and
easy operation.

------
rhacker
It would be nice if all the cloud vendors agreed on a key/value and/or
consensus protocol that all servers in a cluster can connect to - and maybe
even supported via docker, natively even if there's just one cluster member.
Like plug-n-play for clustering tech. (Bonjour basically but suitable for
cloud/enterprise software)

------
ccleve
I have written a Raft implementation in Java. If anyone from the Kafka project
wants it, please contact me. It's not open source, but I own it and could make
it so.

~~~
lars_francke
Not to take away from what you did but the Apache Foundation actually has
Ratis which is a Raft implementation:
[http://ratis.incubator.apache.org/](http://ratis.incubator.apache.org/)

------
PunksATawnyFill
Whoever wrote this doesn't know WTF a quorum is.

~~~
thilog
Care to elaborate?

------
superapc
we (alluxio.io) have gone through a similar process by replacing Zookeeper
with CopyCat (a raft implementation) for both leader election and storing
shared journal since Alluxio 2.0 . Works pretty well

