Hacker News new | comments | show | ask | jobs | submit login
ZooKeeper vs. Doozer vs. Etcd (devo.ps)
138 points by hunvreus on Sept 11, 2013 | hide | past | web | favorite | 90 comments

The story of Doozer is a classic example of how not to steward an open source project.

It was released by two Heroku engineers who promptly completely abandoned it. By completely I mean did not respond to any communication whatsoever for a year or so, despite a very active community that had sprung up around the project in terms of users and forks. I don't begrudge them their lives (or whatever drew them away), but there were people/companies willing and able to take over maintenance, but not even that happened. It probably would have taken just a few hours to hand over maintainership. Instead, just radio silence.

Eventually there was some movement and the most active group of fork maintainers were given commit access, but by that time any enthusiasm over Doozer was long dead and gone.

Not officially a Heroku project AFAIK, but Induction is essentially the same story: https://github.com/Induction/Induction

Open source polyglot database viewer built by Heroku guys, released with a lot of hype, and then radio silence.

I agree. It's a pity Doozer got to this point. I was happy to see that still this years Febrary proactive developers in the google groups (https://groups.google.com/forum/#!topic/doozer/fVcS0y3KuHQ) tried to get the project active and coordinated, but I guess merging different codebases from different forks was too big of a challenge.

What it really needs is a company like this to get behind it and stewart it.

One point the article didn't cover is clients. Making a good Zookeeper client is hard, for two reasons:

1. The protocol is difficult to implement. In theory you could just use Jute to codegen this part, but that assumes Jute supports the language you need. Doozer improves on this with a simple text-based protocol, and etcd goes a little further with a HTTP API.

2. The primitives Zookeeper exposes are very primitive. Implementing higher level abstractions such as locking or leader election on top of znodes is easy to get wrong. Just this year Curator has fixed critical bugs in both of those algorithms: https://github.com/Netflix/curator/blob/master/CHANGES.txt

My feeling is that etcd and doozer fare a little better on #2 just because their primitives are slightly easier to understand, but fundamentally the problem still exists. I'm looking forward to seeing more innovation in this area.

I agree. At FoundationDB, we're writing a coordination tool (working name "beastmaster") that provides service discovery, locking, leader election, etc on top of our transactional key/value store. It is higher level than these tools; our idea is to try to make it useful from a command line or DNS rather than have to be baked into every piece of software that needs service discovery.

    beast service --lock name=mailserver.foundationdb.com --run mailserver.sh &
    beast service name=webserver.foundationdb.com --run webserver.sh &
    ping mailserver.foundationdb.com
    wget http://webserver.foundationdb.com/

That is cool! How are you doing locking for transactions right now internally?

You can use etcd from the command line with etcdctl or grab environment variables for your process using etcdenv.

I would love to see someone back a DNS server with etcd too.

etcdctl: https://github.com/coreos/etcdctl etcdenv: https://github.com/mattn/etcdenv

FoundationDB (https://foundationdb.com) provides optimistic transactions-- the client does reads from a consistent snapshot, and then tells the database what it wants to write and what serializable reads it did, and the database figures out whether the transaction can safely commit or whether it needs to be retried.

Beastmaster is just a client that sits on top of FoundationDB's transactions and fault tolerance, and adds coordination-specific things like a global clock, fair locks with timeouts, and a data model for service discovery. And then provides useful command line tools and a simple DNS and REST server. We should have it on github soon.

That's how our Chubby clone works... it uses the Route53 API to publish service endpoints and leaders.

I'm also in the club of people building one of these things (mine is on top of LevelDB and written in ANSI C with a lock-free design).

My most recent blog post is on getting the thing to bootstrap: http://www.bringhurst.org/2013/09/09/consensus-quorum-bootst...

Ours is on top of LevelDB too but "lock free" in Haskell using the STM.

Is this open source?

Not yet... it'll land here when it's ready https://github.com/alphaHeavy

one other thing to keep in mind is that the underlying algorithms (zab and raft) provide different guarantees.

for example, zookeeper/zab allows reading directly from followers with a guarantee to get at least a past value that won't be rolled back. this was one reason zookeer didn't use paxos:


in my understanding, raft doesn't allow reading directly from followers, because the follower logs may get repaired/rolled-back when a new leader is elected. (though i'm sure an implementation can tweak the protocol to provide this support.)

that said, raft has a lot of interesting applications, and, in my opinion, is definitely more understandable than the many versions of paxos. (implementing zab yourself, at this point, would be a futile exercise.)

i found the videos from the raft user study to be very well done (and easier to understand than even their paper):

raft: http://www.youtube.com/watch?v=JEpsBg0AO6o paxos: http://www.youtube.com/watch?v=YbZ3zDzDnrw

...however, i think they did paxos a disadvantage by not just focusing on multi-paxos (which is probably the most common implementation). but, it's certainly fair to say that info about paxos is spread out far and wide...with perhaps too many knobs to turn and implementation-related details to fill in yourself.

as a side note: i've just started implementing raft in a set of libraries (multiple languages) that will be open source - along with other protocols.

One thing is not correct. Raft allows to read from followers. You will never be able to read uncommitted value in raft.

I'm the original author of go-raft (the implementation used in etcd) so I'll try to address some of the points in this thread.

Raft only updates the state of the system once log entries are committed to a quorum the local state will never be rolled back. Log entries can be thrown out but they haven't been committed to the local state so it doesn't matter.

You can read from the leader if you need to ensure linearizability but that will kill your read scaling. Another approach is to read locally and check that the local raft node isn't in a "candidate" state (which would mean that it hasn't received a heartbeat from the master within the last 150ms). That approach works for a lot of cases.

As far as implementing multi-paxos, the authors behind Google Chubby have talked about how there is a large divide between theoretical multi-paxos and actually implementing multi-paxos. Also, there aren't any standalone multi-paxos Go libraries available. I wrote go-raft at the time because there wasn't an alternative distributed consensus library in Go at the time.

Let me know if you need an extra pair of eyes on your Raft implementation or if you have any questions (ben@skylandlabs.com).

> raft doesn't allow reading directly from followers,

In raft all client connections to followers redirected to use the current master.

> i think they did paxos a disadvantage by not just focusing on multi-paxos

Raft is equivalent to (multi-)Paxos

Having followers redirect to the leader is how the paper describes the algorithm.

But I don't think there is anything stopping you from having followers service committed log entry reads, provided you're willing to live with being out-of-date.

i don't think this is the case. here is a link to a video by one of the authors about log repair during leader election:


log entries on a follower may get rolled back - and thrown out - since they were not accepted on a majority of followers.

Does this mean that there is no read scaling?

The master server must be able to handle the entire read load?

Another issue with zookeeper, the c bindings which are the basis for alot of language bindings, is both buggy and often out of date, esp wrt to individual language bindings. I've often see internal cpu spiked loads on c lib's internal io thread, ie pathological behavior. The recent support for dynamic quorums or transactions has yet to make it to any of the distributed language bindings. At least for python the folks at mozilla put together a nice from scratch library in the form of kazoo, at the cost of reimplementing the entire protocol. Compare that to the simplicity of etcd's, where you can just curl/wget a request.

re #1, as I said elsewhere, at youtube we use zkocc to proxy readonly connections to zookeeper, and zkocc supports e.g. bson-over-http (and in any case it would be easier to add new client protocol support to zkocc than to zookeeper, I think. maybe I just think that because I don't want to java.)

You work for youtube and use zookeeper?

They make some cogent points about Zookeeper, but is javaphobia really a valid concern here? Yes, you have to install a JVM, and Oracle doesn't make that as friendly as it could be. But in my experience ZK doesn't bring along "a ton of dependencies". Likewise, I'm skeptical that the performance of Java v. Go in this case makes a huge difference: you're only spinning the JVM once, at startup.

Maybe I'm too technically conservative, but it seems like the advantages of Zookeeper's maturity and support base outweigh the ickiness of Java and the whizbang factor of Go. Complaining about the politics of the ASF doesn't really factor in; they're still much more likely to be around in 10 months

As we said, we don't hate Java. We were however constrained in what we could install on the host machines since we administrate these on behalf of others. Even lightweight dependencies were a no-go for us.

With regards to Apache, we have "mixed feelings". We're aligned with a lot of what they stand for and are definitely very thankful for what they enabled in the OSS community, but are not enthused with the way they handle the projects that join them.

I get the mixed Apache feelings, but I still don't quite get the dependencies complaint. It seems like your clients have heterogeneous environments (multiple Linux distros, at least). Do you distribute a statically linked binary, or do they compile it on their box? It seems like in this case Go code is actually way more complex to set up.

We actually deal with a limited number of distros for now (mostly Ubuntu and RHE) and we distribute statically linked binaries (meaning no dependency at all).

Even if you only look at amount of RAM used, with Zookeeper it'll never be lower than 35M and most likely be much higher than 50M. With etcd, you can get away with significantly less than that.

The dependencies and JVM boot time are also annoying if you don't already use Java, though.

Are you guys already using etcd in production? I've been using it for a project lately, and I thought it would have low memory usage but it doesn't.

I'm making about ~80 writes/second with only ~125 keys and all 3 of my etcd nodes sit at 200M+ resident memory.

Is 80 writes per second super huge traffic, beyond the realm of what this is designed to handle? I would expect for etcd to be used for things like which machines are responsible for doing what in a cluster, and for the values to change relatively infrequently during events such as adding a new machine to the cluster. Perhaps for a system running 80 requests per second, using etcd to locate a Redis server would be better. Would the memory use be more reasonable then, if it weren't replicating such a huge volume of log transactions?

The readme claims that etcd's can perform up top 1000 w/s, and I don't think 80/s is particularly alot.

At the end of the day there are only ~125 keys (most of which aren't the 80, expire really quickly), and I'm only writing integers (maybe 8 bytes in length converted to strings).

However, I dont know if this is caused my use case or if its just a normal day for etcd. I haven't been able to confirm with anyone else what their memory patterns look like. While running my etcd cluster of the last couple of days my log file has grown 3x to 92M (expected), but my memory usage hasn't grown that much. It now sits at 295M, so I'm unconvinced it has anything to actually to do with the way I'm using it.

I opened an issue here, https://github.com/coreos/etcd/issues/162, and it seems the maintainer confirmed it.

Even though Zookeeper is "mature" it is definitely a beast to fine tune and wade through. Zookeeper definitely does not bring along a ton of dependencies. If you're deploying it as documented it should be running on its own dedicated machines. I don't see why dependencies are a problem there especially with Chef/Puppet nowadays.

Exactly. A "devops" company shouldn't really be complaining about dependencies anyways, they're a fact of life. I admit my experience with ZK has been 50% administering it with Cloudera Manager, which basically abstracts away all the nastiness. I'm curious about specific issues people have had though; I've used ZK with Hadoop and Kafka without any major issues.

Why shouldn't a 'devops' company complain about dependency? It's their domain completely. Treating anything as a "fact of life" is not going to make the problem go away... And what about anyone out there trying to use them who isn't a 'devops' thingy. A painful dependency graph is going to be something that influences their decision on what tool they choose.

Pretty much what I thought. The leaner the better.

Moreover, in our specific case, we did not want to introduce dependencies on hosts that are managed by our customers so as to not run into conflicts with their own stack.

Especially when the standard way to deploy it is likely to unarchive a tar and use the JARs needed directly from the project's lib folder rather than a system-wide/package-manager installed JAR.

Just because you have tools ease dealing with dependencies, it does not mean that those dependencies and the cost of dealing with them has gone away. Easier to not have the problem instead of solving it...

What exactly are there problems with? I am behind a strict corporate network and all I needed to do was download the Zookeeper JAR and write a couple of deployment scripts. Even though the environment variables for the startup scripts are lacking they cover nearly everything you can want.

Zookeeper has legitimate issues and from experience most of which stems from the documentation being so verbose that it takes a lot of fine tuning to get right. But if you put it on its own hardware, or at the very least, the write-ahead log on its own partition, you should be good go.

It's not javaphobia. I think the point is that Go produces a statically-linked native executable, while VM-based languages like Java adding many dependencies and also reduce memory which left for the application itself (assuming z-nodes running on the same nodes as application).

You can run a high performance Java process with a small heap, I would call this a non-issue. I have many heavily-used services with 32-64mb of heap space.

Certainly, Go will end up using less memory but in the day and age of small servers with 1.5GB of RAM, it's really a non-issue.

Right, but bear in mind that their customers will see that ram usage as "wasted" RAM.

I have a product with an agent that runs around 80MB of RAM usage on giant servers, servers with 500+GB of RAM, and my customers still complain sometimes.

DIY is the pink elephant in the room.

Every company in the world that isn't specifically a software-oriented tech company uses some form of DIY model. Actually, strike that, even they use a DIY model. These tools are the proof!

Look at the origins for every modern open source management framework or tool, and it was just a DIY tool that some startup-turned-huge-company developed out of their own needs, then cleaned up a lot and released to the world. Your needs may not match those of the company who developed the tool, so it may not work for you. But I guarantee you that no tool will work for every situation.

Pick the tool that best fits your needs and then fork it and maintain it internally. You'll be doing it anyway. (Unless you don't hire software developers, in which case you'll want to pay for a real product with a support contract) Once you've done that, stop writing naval-gazing blog posts about your infrastructure that won't apply to 99% of us.

I think its more like a white elephant. I totally understand the motiviation to write and blog about these vanity infrastructure software projects but I agree it is hard to stay enthused about reading about another one in the language du jour.

I use postgresql + listen/notify to share and push configuration out to applications.

Each application starts a thread that LISTEN's for NOTIFY's from postgresql.

I have a settings table (name text, value text). Configuration data is stored there.

    insert into settings (name, value) values
      ('sites.my-site.authnet.api_key', 'asdfasdf');
There's a trigger on that table that issues a NOTIFY to all the clients. When the clients receive the NOTIFY, they query the table and store all of the settings in memory.

It works great.

No additional moving parts. All my configuration is stored in the database, they aren't checked into files that need to be protected. No security concerns about api keys stored in a separate service. My settings are backed up along with the rest of my data.

If you already have to maintain a database and don't need lease/lock management or simple failover... broadcasting configuration values is pretty simple, sure.

I chose Zookeeper to achieve this same goal a while ago (before I heard of etcd). I have been pleasantly surprised at how useful having a coordination service is in my infrastructure in addition to a configuration management service. Because of this, even though etcd looks like it serves distributed configuration management better (aka simpler), I'm happy with my choice of Zookeeper.

Two examples of how a coordination service has been useful:

* cluster wide throttles to help protect overwhelmable backends

* redundancy in maintenance cronjobs that really only want to be run once per cluster per time period

(edited for formatting)

How do you power cron from ZK? Do you use something like airbnb's chronos[1]?

[1] http://nerds.airbnb.com/introducing-chronos

No, I took a different approach. Any normal cron job that wants to be run only once (per cluster) grabs a zookeeper lock. Those that fail to acquire the lock exit and try again during the next interval. This is implemented with a wrapper around the cron job that takes care of locking: https://github.com/ParsePlatform/Ops/blob/master/tools/get_z.... More details here: http://blog.parse.com/2013/03/11/implementing-failover-for-r...

You can add YouTube to the list of big players that uses ZooKeeper. We use zkocc [0] to scale (readonly) clients.

[0] http://godoc.org/code.google.com/p/vitess/go/zk/zkocc

Thank's, didn't know that. Yep, Youtube definitely counts as a big player. Added you to the list.

Sorry if this is a dumb question but I've never understood the purpose of these configuration stores. If you're not running in the cloud, but have servers in a datacenter that all mount an NFS share, is there any benefit of these over simply reading a json/yaml file off of an NFS mount?

Resiliency to your NFS server falling over is one. Sane atomic updates and locking is another.

HA NFS is not hard. Using DRDB to replicate the underlying block device, or Gluster to replicate the underlying filesystem, plus IP takeover or e.g. keeaplived is fairly trivial to set up. Running NFS on a single server is a bit like running one of these configuration management systems on a single server - they won't be resilient then either. Using Gluster directly is another easy way of getting resiliency.

Atomic updates and locking is another matter, but for a lot of setups it's simply not needed.

Or you can run a single tool that is designed for this one job.

Gluster in particular isn't a panacea for resiliency, you've got to really know where it departs from POSIX to not create problems for yourself.

Are the GlusterFS people mistaken?


"GlusterFS is fully POSIX compliant."

Ooh, news to me. That certainly wasn't the case 2 years ago.

Atomic updates on an NFS share can be achieved with renaming changed files into place, no?

It sounds like you're expecting NFS to be a reliable, proven, high-performance technology. It's anything but. I've never encountered an implementation without horrible bugs, most commonly involving locking and atomicity issues. Things get oh-so-much worse if you have to mix implementations, and performance is always bad.

I would never consider NFS an option in any production system.

On a more general note, anything you build that relies on the particular semantics of any sort of traditional filesystem is sure to be wrong, either now or in the future when it needs to be run on a different filesystem. This is an area of software engineering that's a serious pain in the ass. Avoid it whenever you can.

Or if you are running in the cloud, simply reading a json/yaml file off S3?

I think there are a few key benefits to a centralized manager over the distributed file listener.

1. coordination between services on reconf. 2. Consistent implementation on what constitutes a change through the API on the central server.

By making it API based you can hook into these updates and cause reconfs and coordinated responses (like rolling bounces) through that system as opposed to each system polling the file and hoping that the order comes out in the wash.

You could with enough work make it so that a client was aware of how to handle individual diffs from the file and coordinates through the file but at that point your now distributing common parsing logic across multiple systems (and potentially implementations) rather than a central system which sounds awfully un-DRY.

Leader election in addition to what else has been said

Leader election - isn't that just a problem created by using an external cluster of servers for storing config? If you were just reading a file, there's no issue of selecting a master from your NFS share.

The two rationales I've heard so far are A. a cluster of one of these config servers will provide better uptime than just an NFS share and B. you can use one of these things as a distributed lock server.

Hey I recently developed a service devoted to nothing but storing configuration files! It seems like I'm kind of in the same space (barely), I'd appreciate some feedback:


If infrastructure (eg system) configuration is the goal, there's already a much more accessible project with backend support (flexibility to hack up support for ldap, zk, etc):


Intro: http://www.devco.net/archives/2011/06/05/hiera_a_pluggable_h...

Note: Does not depend on puppet, so it'll work with chef. A hiera-databag adapter would make sense.

This article is really bad. The arguments against ZK are completely subjective. The pros for Doozer are "it works" which is hardly an evaluation. The features in the pros for etcd are all present in ZK.

About a year ago I deployed Zookeeper to support the Redis failover gem. All of the problems were a result of my lack of knowledge on how to deploy and configure it. My guess is that its mostly used right now in Hadoop/Solr installations and is configured specifically for those.

I've been using Doozer at home on a few side projects and, mainly for it being written in Go, have enjoyed using it a lot more. The point on security is spot on. After reading this post I am definitely going to take a look at etcd.

I kind of can't believe they used the Apache Foundation as a negative bullet point for Zookeeper. Come on, just say you felt like writing your own thing.

They're not the only ones that have gotten a negative impression of the Apache Foundation. Other than the web server, the Apache Foundation seems to largely be a burial ground for open sourced corporate Java projects.

It may very well be they don't deserve that, but if so they really need to improve their pr.

What? They didn't write their own thing.

`and is pretty fragmented (150 forks…)'; I don't think they understand how GitHub works... (forks here does not equal `forks' of a project in the traditional sense)

But in this case I think they're correct. If the project is abandoned and forks don't ever get merged back in, development is fragmented and you end up having to pick and choose commits from various forks.

Or just push your config on S3 ?

Easy to use and setup, reliable, good doc, supports ACL.

The only downside is that it's not broadcasting changes.

People use services like Doozer, Zookeper and Etcd because of consistency+availability guarantees. S3 will give you the availability, but not consistency. If you can sacrifice consistency, there are tons of trivial ways of doing this (in addition to your suggestion of S3):

* DNS with slaves

* LDAP with slaves

* Rsync of a directory of config files

* Pair of web servers or NFS servers or Samba servers using either rsync or a redundant network filesystem (e.g. GlusterFS) or block device (e.g DRDB) to back it.

Solving that problem is easy. It's once you need/want the consistency guarantees things get dicy.

There's also http://arakoon.org/

DNS on its own can be enough in some scenarios

Care to explain how DNS alone can do distributed config management?

Spotify store some configuration in configuration to good effect:


It's obviously no Zookeeper but it is proven and mature.

That's interesting use of dns indeed.

They also mention the DNS for service discovery approach starts to reach it's limits and Spotify is considering Zookeeper (quote):

   We have not yet (as of January 2013) started implementing a replacement. We are 
   looking into using Zookeeper as an authoritative source for a static and dynamic 
   service registry, likely with a DNS facade.

I find their reasons curious. Why in the world are they using zone files? There are tons of DNS servers that support database backends, and writing tools to interact with them is easy.

For that matter, writing an authoritative only DNS backend is easy (been there, done that - took about one week from starting to read the RFC's until having a production ready backend; it takes little time because most/all of the hard work is in the recursive resolvers, and the DNS protocol is actively very well described in the RFCs)

And claiming DNS provides a static view of the world is a bit funny - DNS provides TTL values for everything. If you want a dynamic view, you specify low TTLs, and make sure your clients honour them, and couple that with fast replication of the zone data. There's plenty of options for that, from the duct tape (my DNS server could update however many records you could write to disk on your hardware per second - via a small script that used Qmail as a queueing messaging server...) to well polished products.

Couple that with NOTIFY and IXFR, the protocol provides every mechanism necessary for keeping zone data replicated and up to date. Many modern DNS servers also let you instead simply rely on database replication (e.g. you can have the DNS server serve data out of Postgres for example, and use Postgres replication to keep the zones up to date over multiple servers), or leave it to you to do updates.

The appeal with DNS here is the long track record and existence of servers that have been battered to death in far more hostile environments than most internal service discovery systems ever will need to deal with.

The downside to DNS is that to get things like guaranteed consistency, you'd need a backend that can guarantee it, and clients that don't cache (which means you need to be careful about what resolvers you rely on). And then it might be just as easy to just deploy one of the options in this article (but there's nothing inherent with DNS that prevents that either).

Oh! Treat records as key-value pairs, and encode the value into the IP. If your timeout is 10 seconds, add a DNS record timeout.server.yourdomain which resolves to an IPv6 address with value 10. It gets tougher with ASCII strings, but you could support multi-record configs as well. Then your application just uses nslookup to download the config when you reload it.

If someone builds this I will be their best friend

DNS maybe good for lightweight service discovery, people have been doing it for ages. However I wont waste anytime trying to dress it up as an answer for real world config management problems (distributed, hierarchical, model-agnostic, consistent .."stuff")

The only one on your checklist that DNS doesn't cover is consistency, and there's tons of applications where short term inconsistency is totally acceptable.

Well, at the very least there's a TXT record for strings. Access control and consistency may both be issues though.

Crazily enough, this is built into most DNS implementations. It's called Hesiod-class records: http://en.wikipedia.org/wiki/Hesiod_(name_service)

If you've ever wondered why DNS requires the "IN" (for "Internet") in all its record declarations, it's to make this distinction. The other two options are "HE" (for Hesiod), and "CH", for http://en.wikipedia.org/wiki/Chaosnet.

You can also use DNS like a distributed cache. This is useful when you have millions of clients because their local DNS server will do caching for you. You can also use it like a bloom filter where cache hits are true positives and anything that misses might or might not be a valid key.

That sounds like the opposite of a Bloom filter. In a Bloom filter, you can have false positives but you always get true negatives.

I meant to say "except" where i went on explaining the opposite kind of lookups.

That doesn't sound much at all like a bloom filter.

Look up how service discovery via srv records works. That is how active directory (amongst other things) works under the covers

To do what exactly?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact