
Kafka Removing Zookeeper Dependency - manigandham
https://www.confluent.io/blog/removing-zookeeper-dependency-in-kafka/
======
tbrock
Finally! Travis Jeffery did this years ago in Jocko and also solved my other
beef with Kafka at the same time by building it in Golang.

[https://github.com/travisjeffery/jocko](https://github.com/travisjeffery/jocko)

I’ve always found things built on JVM are are PITA to deploy (especially when
using SSL) so the single Golang binary is a welcome advancement.

It’s all the good things about Kafka (concept, API, and wire protocol) without
all the crap (zookeeper dep, JVM foundation)

~~~
agallego
Agreed. We're working on this exactly -
[https://vectorized.io/redpanda/](https://vectorized.io/redpanda/)

10x faster. API compat. No jvm.

I also know of other private impls of it. Just makes sense.

~~~
snidane
Nice job. I have always wondered why performant big data systems are written
using resource hungry JVM.

~~~
sz4kerto
Because the JVM is fast. And it's not resource hungry - your program might be
resource hungry. Bad desktop Java programs misled a whole generation of
programmers about the JVM.

Look at stuff like LMAX. Java can be lightning fast.

~~~
chillacy
It's fast if you measure throughput or median latency but it tends to have
pretty monstrous worst case latencies due to GC.

~~~
sz4kerto
Depends. Java's GC can be tuned an infinitum. It's not a simple task and
requires knowledge, but that's the beauty of it: if you need low-ish
latencies, then you can tune the GC to target that instead of throughput. For
example we're using relatively large heaps (partly due to inefficiencies in
our code), but we still want to stay under 500 msec/request or so. So we told
the G1 GC to target 150 msecs per collection, and then adjusted our heap size
/ code accordingly. It works well.

If you need really hard limits on collection then that's a tricky problem, but
that's also tricky when you're managing memory yourself.

~~~
lllr_finger
Once you start talking hundreds-to-thousands requests/second, 500ms is an
incredibly long time and you're well past simple tweaks to GC. Tuning GC to a
high degree is non-deterministic black magic, which is not what you're looking
for at that point.

Simple tweaks can go a long way for a lot of developers, but GC performance
has been a problem at the last 3 organizations I've been at - and I'm not in
the valley or at a FAANG - so it isn't exactly an uncommon scenario for
developers.

~~~
apta
The JVM now ships with ZGC, a low latency GC targetted at exactly that.
[https://wiki.openjdk.java.net/display/zgc/Main](https://wiki.openjdk.java.net/display/zgc/Main)

------
lchengify
I couldn't be more excited about this! Having finished a Kafka project about a
year ago, we had so many Zookeeper production and test environment issues that
it was the running joke to check Zookeeper first if anything went wrong.

Honestly Zookeeper in theory is a great idea: Having a centralized service for
maintaining config info saves a lot of heartache when dealing with an open
source distributed systems project. But in practice, I've never had a smooth
experience getting Zookeeper to run consistently, especially with Kafka.

For sure, part of it is that we treat it like oxygen, in that if it's gone for
a few seconds everything just dies. But having dealt with similar systems both
proprietary and open source, my opinion is that Zookeeper just hasn't risen to
the challenges of its users in the past 5 years. If the next generation of
software architects want to use open source streaming or distributed systems,
Zookeeper needs to be rewritten or removed.

Also shout out to the confluent.io team: I never paid for your enterprise
license, but without your blog posts, docker images, or slack room, I never
would have been able to get Kafka working. Thanks again!

~~~
skyde
i am curious to know why people expect a raft library to be more reliable if
embedded inside the kafka controller versus running inside a service like
Etcd.

in the end broker will do RPC to a service (kafka controller/ etcd) and this
service will use raft to replicate the state.

It should be exactly the same. And if anything knowing which node are running
the raft algorithm help you be more careful with rolling restart and upgrade.

~~~
gen220
It might sound kind of trite, but it has less to do with the replication
algorithm and more to do with the fact that Zookeeper’s source is quite
complicated vs, say etcd, so there are more opportunities for subtle bugs to
appear. I encourage you to look at the bug section of the change log for
zookeeper, and also the feature list of ZK vs etcd.

Also, etcd powers many critical open source projects, so there are many
institutional eyes that actively contribute to its improvement. IME if we ever
encountered an issue at work with ZK, we found it impossible to trace it down
to a bug that we could fix and upstream. Etcd’s been easier in this regard.

~~~
lima
Agreed, etcd is rock solid.

With regards to Kafka, it's probably easier and more robust to add their own
consensus layer rather than switching to etcd - Kafka is already a distributed
system built by a team of distributed systems engineers. It makes sense for
them to build their own consensus, deeply integrated with the replication
mechanism, rather than relying on an external database.

------
nealabq
I expect Pulsar to stay with Zookeeper. Kafka currently stores topic and
partition info on ZK (
[https://cwiki.apache.org/confluence/display/KAFKA/Kafka+data...](https://cwiki.apache.org/confluence/display/KAFKA/Kafka+data+structures+in+Zookeeper)
), which can get to be a lot of data. But I think Pulsar only stores server
names and basic config info on ZK (
[https://pulsar.apache.org/docs/en/administration-zk-
bk/](https://pulsar.apache.org/docs/en/administration-zk-bk/) ), which is much
more managable.

~~~
colin_mccabe
_But I think Pulsar only stores server names and basic config info on ZK
([https://pulsar.apache.org/docs/en/administration-zk-
bk/](https://pulsar.apache.org/docs/en/administration-zk-bk/) ), which is much
more managable._

Unfortunately, this is not correct. BookKeeper stores a lot of information in
ZookKeeper. By extension, Pulsar (which is based on BookKeeper) also stores a
lot of metadata there as well.

For example, from the BK documentation (
[https://zookeeper.apache.org/doc/r3.3.6/bookkeeperOverview.h...](https://zookeeper.apache.org/doc/r3.3.6/bookkeeperOverview.html)
):

 _An application first creates a ledger before writing to bookies through a
local BookKeeper client instance. Upon creating a ledger, a BookKeeper client
writes metadata about the ledger to ZooKeeper. Each ledger currently has a
single writer. This writer has to execute a close ledger operation before any
other client can read from it. If the writer of a ledger does not close a
ledger properly because, for example, it has crashed before having the
opportunity of closing the ledger, then the next client that tries to open a
ledger executes a procedure to recover it. As closing a ledger consists
essentially of writing the last entry written to a ledger to ZooKeeper, the
recovery procedure simply finds the last entry written correctly and writes it
to ZooKeeper._

------
hugofirth
I never know where I sit on stuff like this. On the one hand if you’re
confluent I think it makes total sense to own this part of your
infrastructure. Especially if it lets you improve your operability story.

On the other hand I feel like projects should try and use open source
“building-blocks”, like etcd and zookeeper, when building their distributed
systems. Not only does this help iron out correctness bugs, but it also means
that more people are familiar with the quirks, limitations, requirements
etc.... of these tools. For example, I think I would be frustrated to hear
that K8s were implementing their own raft.

~~~
sagichmal
Counterpoint, if I need to deploy technology T to solve problem P, I don't
want to have to also deploy a flotilla of support technologies because
modularity or whatever. ZooKeeper was always an implementation detail of
Kafka, the fact that you had to manage it separately was an abstraction leak
that I'm happy to see fixed.

~~~
saurik
I feel like this just argues that ZooKeeper should be a library that can be
embedded in the servers of other projects (even if it is conceptually a
separate server in the same memory space listening on its own ports; but like,
it is otherwise entirely hidden inside of the other program and is entirely
managed and configured by it also).

~~~
skyde
100 % agree we need good (etch/zookeeper) as a library.

I wrote something similar but that provide only (lock/lease) using paxos.

The problem with raft ETCD and zookeeper replicates state machine design is
that : \- machine leaving and joining the ensemble dynamically is something
hard to do correctly \- optimal quorum size is no more than 5, you can setup
other node as observer but it’s hard to decide which of 1000 node should be in
the quorum ...

~~~
lima
The etcd server can be embedded in other projects, and their Raft library is
used by plenty of other projects like CockroachDB.

~~~
skyde
yes it can but it’s a real pain compared to hashicorp version. I really hope
they clean up the library to have a simple API.

This way it could become the standard for GO and we can stop wasting effort on
multiple raft implementation.

------
fourseventy
This is cool. One thing I love about Cassandra is that you don't need
Zookeeper, its just a bunch of homogeneous nodes.

~~~
mavelikara
Same with Elastic Search.

------
kungfufrog
When will this be actually released and available for use?

I'd love to ditch Zookeeper (the weak part that falls over more regularly) in
our current Kafka cluster.

------
eatonphil
Here's the proposal in full.

[https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A...](https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-
Managed+Metadata+Quorum)

------
KenanSulayman
Finally! Having had to deal with properly setting zookeeper up on multiple
mesos clusters was so painful and recovering from outages via s3 backups so
excruciating.. I feel somewhat ashamed that in two out of five cases we simply
went with the managed Kafka by AWS because the team didn’t feel confident
enough to maintain the cluster.

~~~
gen220
Honestly this is probably the right thing to do. My cynical conspiracy is that
the ZK dep was organized by cloud vendors to keep customers coming :)

It’s been at least a year and a half since we’ve had severe data loss on our
homebrew Kafka cluster, but after the first couple you never look at Kafka+Zk
the same way... both iirc were due to leadership election bugs that had been
reported many months ago with no progress on a solution.

I have no idea how AWS internally puts up with this. I wouldn’t be surprised
if they’d replaced the ZK dependency internally years ago.

------
xorchasiv
I might be out of my realm here but would it be more efficient for the Kafka
brokers to use a gossip protocol to distribute the metadata?

~~~
jdean677
Gossip protocol is useful for propagating node failures or events that are
fine with being eventually consistent.

It cannot be used for leader election as these events are time sensitive and
needs to be consistent across the cluster within a short duration, that is the
reason we have raft & paxos.

~~~
xorchasiv
Ah I see, thank you for the clarification

------
devin
Amazing that people chose to operationalize Kafka with this completely
unnecessary dependency. Good riddance!

------
ironfootnz
Well... it’s just Kafssandra now.

