Hacker News new | past | comments | ask | show | jobs | submit login
ClickHouse Keeper: A ZooKeeper alternative written in C++ (clickhouse.com)
218 points by eatonphil on Sept 27, 2023 | hide | past | favorite | 126 comments



Coincidentally, as someone who worked on this blog, I was surprised (and pleased!) to see that we are not the only ones who felt the need to build a Zookeeper alternative.

Looks like folks at StreamNative did as well, with their Oxia project: https://github.com/streamnative/oxia. They were just talking about this yesterday at Confluent Current ("Introducing Oxia: A Scalable Zookeeper Alternative" was the title of their talk). https://streamnative.io/blog/introducing-oxia-scalable-metad...

Seems to be a trend :)


looks to be a slightly different design goal on oxia wrt to replication and fault tolerance. https://github.com/streamnative/oxia/blob/main/docs/design-g...


Def.

I hadn't seen Oxia before but the idea, for their implementation, of making Zookeeper more like Bookkeeper was an interesting one.

Not right for ClickHouse needs but, IMO, a novel approach.


Meta additionally has some internal software called Zelos, which is the zookeeper api but implemented on delos[0], though you didn't hear it from me

[0] https://www.usenix.org/system/files/osdi20-balakrishnan.pdf


Is the trend mainly due to ZK being written in Java?


It's not the written in Java part, it's the running in the JVM that's the issue. Memory hungry is what I think of when I think Java apps. I'd much rather have a low memory Go or Rust service.


but if you use ZK for what it was designed for (basic cluster role/naming cordination, distributing configs to cluster nodes and similar) then this kind doesn't matter

I mean in a typical use case of it you would

- run it on long running nodes (e.g. not lambda or spot instances)

- run more or less exactly 3 nodes up to quite a cluster size, I guess some use-cases which involve a lot of serverless might need more

- configs tend to not change "that" much nor are they "that" big

what this means is

- java needing time to run-hot (JIT optimize) is not an issue for it

- GC isn't an issue

and if you look at how much memory (RAM) typical minimal nodes in the cloud have it in context of typical config sizes and cluster sizes is also not an issue

through I guess depending what you want to do there could be issues if you use it

- for squeezing through analytics or similar

- setups with very very very large constantly changing clusters, e.g. in some serverless context with ton of ad-hoc spawned up instances, maybe using WASM and certain snapshot tricks allowing insane fast startup time

- you want to bundle it directly into other applications, running on the same hardware and that applications need more memory

but all of it are cases it wasn't designed for so I wouldn't call it an "alternative" but a ZooKeeper like service for different use-cases, I guess


Long-running server processes are not only "not an issue" for the JVM, they're the main use case in which the JVM is superior to AOT compilation! Same is true for C# and the .NET CLR, by the way.

If you're running a Lambda function in which startup time is extremely important, or an embedded application where size and resources are paramount, or even just a short-lived process where you don't care either way... then AOT makes a lot of sense.

But for long-running server processes, just-in-time compilation almost always results is better performance than AOT compilation that cannot optimize at runtime based on what's actually happening.

HN should be full of people who know better, but these discussions feel like piping information into /dev/null. Web devs, students and hobbyists, and other low-information voters just have it in their heads that AOT is always a superior model, and JIT always an inferior fallback, and there's nothing you can say to break through that. There aren't enough people from the business server side world who spend enough time in online discussion forums to correct the narrative.


I agree that a good JIT and the JVM is very powerful, but it's not a silver bullet that magically works for all systems. For example, JIT-compiled code can become uncompiled and recompiled all the time in the same process, resulting in weird performance characteristics. You can definitely fine tune and code in ways that prevent this, and for many apps it won't matter, but when it does matter it's a pain in the butt.


More on this: "Virtual Machine Warmup Blows Hot and Cold" https://arxiv.org/abs/1602.00602


I think there's an important distinction to be made: Theoretically speaking, given that JIT compilers generally care much more about compilation speed than AOT ones, it is not unreasonable to assume that AOT compilers ought to produce more optimized code.

With that said, this is not the case for both JVM and .NET. Stepping away from ought to is, both have JIT compilers which produce better optimized code than their AOT counterparts, due to a variety of reasons including R&D effort done for JIT throughout their history and JIT allowing to dynamically profile and recompile the code according to its execution characteristics (.NET's Tier 1 PGO Optimized and HotSpot JVM's C2).


Theoretically speaking, JITs have access to strictly more information than AOT, so they ought to be better once you amortize out the timing issues. A really good profiling JIT will do a fast-compile pass the first time through, and then progressively re-compile with knowledge gained from profiling during runtime.


“Has more information therefor better optimized” is bordering on a lie, is the problem.

Theres truth to a point I guess, but it’s also true that lots of “more information” is useless, and sometimes harmful to optimization, and additionally that it’s really diminishing returns.

Truth is as long as you’re not over relying on RAII, JVM will virtually never outperform C++.


It's not a lie at all.

That's why there is profile guided optimizations for C/C++.

Which instruments you C/C++ code to collect that needed additional information (on the cost of performance).

Then when recompiling you can feed that collected information into your system.

The problem with profile guided optimization is that it's way more annoying to deploy as on every update you have to deploy it twice once slower then naive and then once faster. And because the slower part might very well be to slow you might want to only deploy it to some nodes of a load balancer and deploy naive compiled versions to the other.

It also means you have a similar slower => faster startup time, excepts it's of your program as whole instead of for each restart of a node.

And while optimized Java likely will never outperform optimized C++ that is Java specific furthermore if we speak about common non manual optimized code which isn't implementing some tight math/CS algorithms (i.e. very common daily code in many companies) then the difference really isn't that big, small enough to make choosing java over C++ for server stuff a "in general" the right choice. (If you are not a company which only gets the "best" programmers like google).

I mean just to put it into context there are companies which had success with stuff like running high speed trading code on the JVM and it was a success. So if you can do that probably Java doesn't have a major performance problem.

> not over relying on RAII

you mean like throwing out all the major improvements of C++ which majorly reduced the probability of non highly expert programmers introducing bugs which could be turned into RCEs?


Suggesting that you shouldn’t be allocating memory and then immediately tossing it doesn’t only suggest that you should practice RCE prone manual strategies.

RAII is dogshit because it encourages and hides the fact of what’s really happening, leading to crazy performance gotchas. Just knowing the gotchas around it is often enough for any half way competent programmer to devise better code.


RAII is the foundation of safe memory management in C++


If the "more information" is useless, it can be ignored. JIT can't possibly be a worse strategy, because a JIT always has the option of simply doing an AOT compile at program start.

That doesn't mean that in practice the JVM is faster than well-written C++; the Java language semantics almost prevent it from doing so in the general case. But in principle, if all else were equal, it should be able to be.


You've not had to diagnose ZK mysteriously hanging or pegging its CPU due to GC or mysterious memory bloat? I have


And the JVM and Java ecosystem has much better profiling tools than when Golang, e.g. has mystery GC pauses.

For a long running, memory safe, server process, there's no other ecosystem (that I know of) that is quite like the JVM.


Consider this: Kubernetes people felt etcd was a better choice, largely because of performance.

https://news.ycombinator.com/item?id=18687516


etcd is pretty good in itself, but Kubernetes has many design choices that do not translate well to other projects, or even to Kubernetes itself.


I mean java native compilation is a thing. I imagine you'd get the majority of the lift via that route instead of reinventing the wheel, but it's possible there are caveats to that route for these impls. Even java native/golang have GC heaps, but in general ZK is really low GC churn so I don't see that being the impetus to rewrite.

My guess if anything is that people always complain about ZK being annoying as a second piece of infra to distribute for simple setups, so my guess this is just a prelude to them embedding their keeper into the DB deployment itself which is the same general strategy Kafka is taking (a verrrry loong time) to rollout.


If you need a Zookeeper drop-in replacement then I suppose there’s limited options, but many of the same needs could be met by etcd.


TBH, I don't think so.

I mean, I have worked and, and been guilty of tooling driven development (RiiR anyone?) .

But, also, in a comment below Alexey shares many of the reasons other than language. I think Oxia does a good job of sharing their approach in - https://github.com/streamnative/oxia/blob/main/docs/design-g...

(Alexey's comment, FYI, https://news.ycombinator.com/item?id=37677324)


It's been a few years since I've checked in with distributed lock services. Why would someone adopt ZooKeeper after etcd gained maturity? I recall seeing benchmarks more than 5 years ago where a naive proxy like zetcd[0] out-performs ZooKeeper itself in many ways and offers more consistent latencies. etcd has gotten lots of battle-testing being Kubernetes' datastore, but I can also see how that has shaped its design in a way that might not fit other projects.

I think there are plenty of other projects (e.g. FoundationDB, Kafka) that also replaced their usage of ZooKeeper as their systems matured. I guess I'm confused why anyone has been picking up new installations of ZooKeeper.

[0]: https://github.com/etcd-io/zetcd


There is no specific reason to start with ZooKeeper, nor with ClickHouse Keeper, if you want to use another distributed consensus system.

But: every such system is slightly different in the data model and the set of available primitives.

It's very hard to build a distributed system correctly, even relying on ZooKeeper/Etcd/FoundationDB. For example, when I hear "distributed lock," I know that there is 90% chance there is a bug (distributed lock can be safely used if every transaction made under a lock also atomically tests that the lock still holds).

So, if there is an existing system heavily relying on one distributed consensus implementation, it's very hard to switch to another. The main value of ClickHouse Keeper is its compatibility with ZooKeeper - it uses the same data model and wire protocol.


Foundationdb doesn’t use locks, due to that it’s relatively easy to build distributed systems on top, but the trade off are 5sec transaction limits.


The term “distributed lock” is a bit of a mental red flag to me.


As one of the contributors, I'm always happy to see interest and people using it.

Keeper is a really interesting challenge and we're really open to any kind of feedback and thoughts.

If you tried it out and have some feedback for it, I encourage you to create an issue (https://github.com/ClickHouse/ClickHouse), ask on our Slack, ping me directly on Slack... (just don't call me on my phone)

And don't forget that it's completely open-source like ClickHouse so contributors are more than welcome.


shared mergetree is not open!


So someone who gives away free software must give away all software they write forever?


Can you elaborate? The software is distributed under an Apache 2.0 license.


he means the new SharedMergeTree, but that’s clickhouse specific.


I usually scoff at "written in.." part of such announcements, because it is a sign that the author is focused on the input ("I wrote this in X") not the output (value the user gets)

In this case though, the blog outlines specific reasons why this had to be in C++ (interoperability w. their C++ codebase) as well as benefits that are separate from the language.


> author is focused on the input ("I wrote this in X") not the output (value the user gets)

Possibly, a lot of us enjoy developing X in Y for the sake of doing so. Not everyone may end up caring about the value that the user gets.


It's a huge turn-off for me as well, because I interpret it as the main value it's supposed to deliver (which for me is 0 for the most part). Not talking about this specific project, just generally.


In this case written in C++ is goodness. Also, having it be a variant of ClickHouse server that can run embedded or standalone is quite nice.


I'm always impressed by the quality of the blog posts coming out from clickhouse.com! Super well written!


Thank you! We try ;)


Basho blog quality 4ever.


It's a deep-cut reference. But it's one I am honoured you made.

:mug:


Looks nice, I will definitely be trying this out.

Built in s3 storage immediately sold me. I’ve used something called Exhibitor to manage ZK clusters in the past but it’s totally dead. Working with ZK is probably one of my least favorite things to do.


I used exhibitor in the past too. They were specially useful during the time when zookeeper cluster needed to expand/shrink or move host nodes. Zookeeper dynamic configuration solved that problem, which seems to be also supported by clickhouse keeper, Pretty impressive! Would definitely give it a try.


It does indeed.

Do note the docs page...

https://clickhouse.com/docs/en/guides/sre/keeper/clickhouse-...

In particular, it is necessary to enable the `keeper_server.enable_reconfiguration` flag. It is pretty exhaustive coverage but if there is an important use case missing, let us know!


Thanks for sharing!

If anyone has any questions, I'll do my best to get them answered.

(Disclaimer: I work at ClickHouse)


Thanks for this excellent article! Enjoyed it from start to finish. This gave me a good memory of the work we've done at docker embedding our own replicated and consistent metadata storage using etcd's raft library.

Looking at the initial pull request, is it correct that ClickHouse Keeper is based on Ebay's NuRaft library? Or did the Clickhouse team fork and modified this library to accommodate for ClickHouse usage and performance needs?


Yes, you are right ClickHouse Keeper is based on NuRaft. We did a lot of modifications for this library, both for correctness and performance. Almost all of them (need to check) are contributed back to upstream ebay/NuRaft library.


1. can this be used without clickhouse as just a zookeeper replacement? 2. am i correct in that its using s3 as disk? so can it be run as stateless pods in k8s? 3. if it uses s3, how are latency and costs of PUTs affected? does every write result in a PUT call to s3?


1. Yes, it can be used with other applications as a ZooKeeper replacement, unless some unusual ZooKeper features are used (there is no Kerberos integration in Keeper, and it does not support the TTL of persistent nodes) or the application tests for a specific ZooKeeper version.

2. It could be configure to store - snapshots; - RAFT logs other than the latest log; in S3. It cannot use a stateless Kubernetes pod - the latest log has to be located on the filesystem.

Although I see you can make a multi-region setup with multiple independent Kubernetes clusters and store logs in tmpfs (which is not 100% wrong from a theoretical standpoint), it is too risky to be practical.

3. Only the snapshots and the previous logs could be on S3, so the PUT requests are done only on log rotation.


2. ok. so can i rebuild a cluster with just state in s3? eg: i create a cluster with local disks and s3 backing. entire cluster gets deleted. if i recreate cluster and point to same s3 bucket, will it restore its state?


It depends on how the entire cluster gets deleted.

If one out of three nodes disappears, but two out of three nodes are shut down properly and written the latest snapshot to S3, it will restore correctly.

If two out of three nodes disappeared, but one out of three nodes is shut down properly and written the latest snapshot to S3, and you restore from its snapshot - it is equivalent to split-brain, and you could lose some of the transactions, that were acknowledged on the other two nodes.

If all three nodes suddenly disappear, and you restore from some previous snapshot on S3, you will lose the transactions acknowledged after the time of this snapshot - this is equivalent to restoring from a backup.

TLDR - Keeper writes the latest log on the filesystem. It does not continuously write data to S3 (it could be tempting, but if we do, it will give the latency around 100..500 ms, even in the same region, which is comparable to the latency between the most distant AWS regions), and it still requires a quorum, and the support of S3 gives no magic.

The primary motivation for such feature was to reduce the space needed on SSD/EBS disk.


Sometime back, I tried using clickhouse-keeper as zookeeper alternative with few other systems like kafka, mesos, solr, Wrote some notes here: https://pradeepchhetri.xyz/clickhousekeeper/


1. Absolutely. clickhouse-keeper is distributed as a standalone static binary or .deb package or .rpm package. You can use it without clickhouse as ZooKeeper replacement. 2. It's not recommended to use slow storage devices for logs in any coordination system (zookeeper, clickhouse-keeper, etcd and so on). Good setup will be small fast SSD/EBS disk for fresh logs and old logs + snapshots offloaded to S3. In such setup the amount of PUT requests will be tiny and latency will be as good as possible.


Is there a python client library you can recommend?


All ZooKeeper libraries are compatible with clickhouse-keeper. The most popular and mature is https://kazoo.readthedocs.io/en/latest/. We use it in our integration tests framework (with clickhouse-keeper) a lot.


The same library that you use for ZooKeeper - kazoo.

Note: our stress tests have found a segmentation fault in Python's kazoo library.

We only wanted to test Keeper, but found every bug around it :) Let me find a link.



Did not expect to see issue I created


What do you use for network stuff in C++, ASIO?


Yes, for internal RAFT implementation boost.asio is used.


Man i would love to work at one of these companies that let their engineers go off and implement alternatives to widely used and tested open source projects like Zookeeper merely for speculative performance gains and “… C++”

edit: I'm not even saying this facetiously, it would be freaking awesome.


I've been using Clickhouse Keeper in production environment since it's first release and have been really happy. It makes setting up distributed tables in Clickhouse (replicas/shards) quite easy. The only issues I've seen is when inserting at a super high throughput (>1000/sec writes) which is actually more of a n issue with Clickhouse Merge Table tree settings rather than Keeper itself.

I've also written about it here: https://mrkaran.dev/posts/clickhouse-replication/


I hate running services in Java but this will have to earn a lot of trust before it’s a viable replacement in prod


++ agreed.

ClickHouse Keeper was released as feature complete in December of 2021.

It runs thousands of clusters, daily, both in CSP hosted offerings (including our own ClickHouse Cloud) and at customers running the OSS release.

Never accept any claims at face value and always test. But, in this case, it is quite battle-hardened (i.e. the Jepsen tests run 3x daily https://github.com/ClickHouse/ClickHouse/tree/master/tests/j...).


That’s a strong endorsement. I wonder if there’s been any effect where it’s strongly tailored to the API surface area utilized by ClickHouse and there’s any gaps elsewhere


We hope not and try to keep in wire compatible for clients to interact (recently added dynamic reconfig, etc.)

It is definitely opinionated and influenced by our work...but not designed solely for it.

But, also, we continue to improve. Most notably in the work on Multi-group Raft - https://github.com/ClickHouse/ClickHouse/issues/54172


Especially as it doesn't have memory safety, which is table stakes in 2023.


Any thoughts here on Fly's Corrosion? https://github.com/superfly/corrosion


At least two comments spring to mind: this is at least _blogged_ as a drop-in ZK replacement, which for sure is not true of Corrosion, and ClickHouse has Jepsen tests for their distributed KV store, which I don't see any reference to such a thing for Corrosion

Maybe neither of those two things matter for one's use case, but it's similar to someone rolling up on this blog post and saying "but what about etcd" -- they're just different, with wholly different operational and consumer concerns


Does ClickHouse still have some relationship to Yandex?


It seems like these days, Yandex is just one of multiple stock holders; see:

https://en.wikipedia.org/wiki/ClickHouse


"ClickHouse, Inc. is a Delaware company with headquarters in the San Francisco Bay Area. We have no operations in Russia, no Russian investors, and no Russian members of our Board of Directors."

Source - https://clickhouse.com/blog/we-stand-with-ukraine


"A Delaware company" ... funny :-)

Also scary how disciplined US society is, with everyone informally-required to actively cheer the proxy war (while no company "stands with Yemen" for example).


Do you provide a decent C++ client library? ZooKeeper only provides a C library that has certain... disadvantages.


Is it just me or does it look like an "alternative for use-cases ZooKeeper was not intended for"?

E.g. if we quote ZooKeeper:

> ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

and ClickHouse

> ClickHouse is the fastest and most resource-efficient open-source database for real-time applications and analytics.

Like this are completely different use cases with just a small overlap.


And then further down:

> ClickHouse Keeper is a drop-in replacement for ZooKeeper

That opening was about ClickHouse in general, but the article is about one particular application using the database.


Yep.

Generally, ClickHouse Keeper provides the coordination system for data replication and distributed DDL query execution for ClickHouse clusters.


drop in replacement doesn't change anything about my question, interfaces are not all what matters

and now knowing more about it I would say the answer to my question is a clear yes, it used ZK for something ZK wasn't at all intended for

which means it makes a lot of sense that they replace it


Are you perhaps confusing ClickHouse, with ClickHouse Keeper (one of ClickHouse's components?) Sounds to me like ClickHouse is the database, and ClickHouse Keeper is the ZK drop-in replacement. A bit like HBase being a database, and ZooKeeper being a service it is heavily dependent on.


yes and no

yes I confused the quotes

but the question in general about weather ClickHouse Keeper and ZooKeeper are designed for completely different use-cases which happen to be similar and can work with the same interface still stands

and by now knowing more then my original comment I would answer the question with yes


Compatible possibly but unproven in production at scale like ZK.


Definitely used in production ;) and at rather some scale.

It runs thousands of clusters, daily, both in CSP hosted offerings (including our own ClickHouse Cloud) and at customers running the OSS release.

Never accept any claims at face value and always test. But, in this case, it is quite battle-hardened (i.e. the Jepsen tests run 3x daily https://github.com/ClickHouse/ClickHouse/tree/master/tests/j...)

But yes, ZooKeeper is pretty amazing. We are building on the backs of giants.

I'd also argue the RAFT v. ZAB is an important production scale conversation. But, as the blog says, Zookeper is a better option when you require scalability with a read-heavy workload.


Yet another thing I have just used FoundationDB for in the past.


I’d love to read about that.


So could I just point my Kafka at this thing and use it?


fundamentally, yep.

https://pradeepchhetri.xyz/clickhousekeeper/ talks about some experiments in exactly that vein.


You can even migrate your zookeeper to ClickHouse keeper. It requires small downtime, but you will have all your zookeeper data inside and your clients will just work when your keeper will be back


Nice to see an alternative for Zookeeper that doesn't depend on the Java runtime

I thought stuff were supposed to be rewritten Rust /s


lol.

I was waiting for that somewhere ;)


Written in C++ is not a positive in my book. New things created this decade in unsafe languages (where safe options would have worked fine) should be frowned up and criticized as bad engineering.


> where safe options would have worked fine

Thats a very short-sighted view IMO, software engineering is not just about technology choices.


Software engineering is first and foremost about making good engineering choices. We are proving time and again as an industry that humans CANNOT write mistake-free code.

From just this week: https://news.ycombinator.com/item?id=37600852

Just because someone built a fantastically functional building doesn't mean we can't criticize their choice of foundation. Case in point: Millennium Tower in SF: https://www.nbcbayarea.com/investigations/series/millennium-...


And when making those engineering choices there are different tradeoffs and constraints to be considered. The language to use is one of them. So when a Rust (wild guess) fanboy comes without any background context and makes comments like yours is very telling.

You are correct that bug free software does not exist. But choose a “memory safe” language does not prevent that. A seasoned C++ developer knows how to use memory sanitizers and other tools to guarantee the correctness of its code compared to an average Rust developer that just trust the compiler which, guess what, also may have bugs.


Clickhouse is a Yandex project and at Yandex, they historically use C++ for almost everything, I guess it's part of their culture (probably the founders were C++ programmers?) Their web services such as Yandex Taxi's backend (Uber's equivalent) are also written in C++ which is unusual for webdev nowadays.


s/is a/was a/g

But more interesting, to me, is language adoption and familiarity by region.

I have a bookmarked dev.to article from 2020 that discussed programming language popularity by state - https://dev.to/eduecosystem/what-is-the-most-popular-program...

I'm uncertain if anyone has extrapolated that to more geographic regions. It would be interesting.


[flagged]


We've banned this account for breaking the site guidelines and ignoring our request to stop.

https://news.ycombinator.com/newsguidelines.html


And very bad at accounting :P


I've been looking at RedPanda[0] for a new project.

Any ideas on that?

[0] https://redpanda.com/


It's been a struggle getting Clickhouse accepted here in the U.S., despite its technical prowess, even prior to the war in Ukraine.

I know, I read the blog:

> ClickHouse, Inc. is a Delaware company with headquarters in the San Francisco Bay Area. We have no operations in Russia, no Russian investors, and no Russian members of our Board of Directors. We do, however, have an incredibly talented team of Russian software engineers located in Amsterdam, and we could not be more proud to call them colleagues.

The FUD is really hard to overcome. This is coming from someone who advocated for Clickhouse, sent some PRs, and did a minor code audit.


Why is "written in C++" part of the headline?

As engineers we focus too much on the implementation details and not the benefits to the user.

How about:

- ZooKeeper alternative with lower latency

- ZooKeeper alternative with lower memory use

- ZooKeeper alternative with predictable overheads

(I don't know if these are true, just suggestions)


ZooKeeper alternative with:

1. Snapshots and logs take much less amount of space on disk due to better compression.

2. No limit on the default packet and node data size (it is 1 MB in ZooKeeper)

3. No zxid overflow issue (it forces restart for every 2 bn transactions in ZooKeeper)

4. Faster recovery after network partitions due to the use of a different distributed consensus protocol.

5. It uses less amount of memory for the same volume of data.

6. It is easier to setup, as it does not require specifying the JVM heap size or a custom gc implementation.

7. A larger coverage by Jepsen tests. (This could be hard to believe, but true - ZooKeeper is tested by Jepsen, but Keeper takes the existing tests and adds more).

8. The possibility to store snapshots and previous logs on S3.

C++ isn't a key detail, just a consequence of the fact that the main ClickHouse code base is written in C++.

If you need a distributed consensus system but not necessarily compatible with ZooKeeper, there are plenty of options: Etcd, Consul, FoundationDB...


2. is configurable (with specific caveats) 3. I believe this was solved a very long time ago? Don't epochs rollover automatically now? 6. is this remotely relevant? You still want limits in cloud deploys so not sure how this is remotely a consideration given it takes 2 minutes when you first set it up to use best practices settings.


Disclaimer: I worked on this blog with the team at ClickHouse.

I like your suggestions!

Some of the benefits we summarized in this summary page https://clickhouse.com/clickhouse/keeper include ease of setup and operation, no overflow issues, better compression, faster recovery, (dramatically) less memory used, etc..

There was actually a reason why C++ was important for us at ClickHouse, and it's because C++ is our main code base and managing a Java project as part of it was not natural, but you right - for standalone use of this alternative, that doesn't matter.


How much did memory safety factor in the decision? I mean only last week the IT world was yet again bitten by a major memory safety bug, in libwebp.


Given that their entire code base is written in C++, and switching to a different language would be a significant retooling for the team, I think it's reasonable to assume that it did not. Language choice is rarely made on the grounds of specific features, and is often made on the grounds of ergonomics and team knowledge.

A more revealing question is, "How are you dealing with memory safety in this implementation?" There are ways to improve memory safety in C++ through tooling and idiomatic style. Are these things being used?


Keeper is tested in the same way as ClickHouse.

There are Keeper only tests, but we run ClickHouse with Keeper for all of our server tests.

For each test we try to use all useful tools for verifying safety and correctness like sanitizers.

E.g. an interesting tool we introduced in our codebase for thread safety https://clang.llvm.org/docs/ThreadSafetyAnalysis.html#

We found some issues using sanitizers in our codebase and NuRaft library itself which were instantly fixed.

And let's not forget about Jepsen which showed some really tricky bugs but were more related to the correctness.


Thanks. That is useful.

I would suggest looking into CBMC and similar tools as well. Model checking is incredibly useful.


Sadly I never put enough effort into trying out such checks. but your excitement about them gives me motivation to properly try them out.


CBMC is subtle and will require some code changes to use effectively.

The real key for using it, in my opinion, is to isolate individual classes and functions. Avoid instrumenting code with recursion and loops, and focus on defining and verifying function contracts, class invariants, and resource / memory lifetimes.

It will require a significant amount of work to mock up standard library and third party library APIs, but the real beauty of CBMC is that once you define the interface contracts for these APIs and libraries, you can verify every use of them.

I used CBMC previously to verify proper usage rules with C / JNI integration. JNI can be one complicated beast, and CBMC handily managed rule checks for its use.

I'm an extremely careful developer who unit tests everything and strives for 99% coverage. CBMC was still able to detect a memory overwrite flaw in a networking library I wrote that was based on undefined behavior due to integer promotion and offset math. This passed the various sanitizers and unit tests I had in place, but CBMC was able to reduce it to an actual crash condition that was potentially exploitable.

I don't think I can over-emphasize the usefulness of this tool.


Disclaimer: I working on this blog at ClickHouse with the team.

We'll look in to them and adding to some of our social promotion over the coming weeks. Will try to find a way to give you credit.


I'd be most interested in a Zookeeper alternative that doesn't have massive bugs in leader election.


Since the article states that they're using Raft and not ZAB for the consensus algorithm and leader election, it must be less prone to bugs when it comes to electing a leader. Since Raft is easier to reason about and the leader election process is more straightforward (Raft minimizes the chance that any two nodes will be candidates at the same time and thus avoids starting multiple concurrent elections).


Just adding on to a nice response from abronan, our internal protocol also does some optimizations when it comes to leader election, e.g. Pre Vote protocol (https://github.com/eBay/NuRaft/blob/master/docs/prevote_prot...)

Also, we apply many different faults in our Jepsen tests which are run 3 times a day and we never had a problem with leader election. I know this doesn't confirm that there is no bug in it but it's pretty reassuring I would say.


Running services on the JVM is terrible and requires much more resources than other platforms


Depends on how much one cares about memory corruption, developer tooling and library ecosystem.


Yes I agree memory safety is an important trade off, but the performance wins can be worth it in some cases.


It is to let you know to not use it, since it is a pile of Memory related CVEs just waiting for the joy of discovery.


That's true - C++ libraries are typically bug-ridden and require exhaustive efforts to clean up.

But the latest bugs found by ClickHouse continuous integration system in the related library were fixed about a year ago:

https://github.com/eBay/NuRaft/pull/373 https://github.com/eBay/NuRaft/pull/392


If even Google, Mozilla, Apple and Microsoft can't get C++ right in their apps with all the fuzing tools they invented I don't have much faith other companies are going to fare better with memory safety and C++.


Java vs C++ is very important implementation detail, especially for 'the benefits to the user'. Java is commercial platform which requires a fee to Oracle if used in enterprise, while C++ complied binary does not.


That's a _very_ incorrect statement. You can use any OpenJDK (which is GPLv2 with classpath exception) distribution you want to run Apache Zookeeper without having to have any agreement with Oracle or pay any fee. The Oracle JDK is just Oracle's commercial version of their OpenJDK distribution with Oracle support.

You can use the OpenJDK distro shipped in your Linux distro (RedHat, Debian, etc.), you can use Microsoft's OpenJDK distro[1], you can use the Eclipse OpenJDK distro, you can use Amazon's OpenJDK distro [3] and there are a whole bunch more.

[1] https://www.microsoft.com/openjdk [2] https://adoptium.net/ [3] https://aws.amazon.com/corretto/


The usual FUD, no it doesn't require any fee, use OpenJDK.

Several distributions to chose from.


What year are you from to say such nonsense?


Or you can just use OpenJDK, right?


All ZooKeeper installations I've seen so far in production were on Oracle Java JDK for one or another reason.


Probably because people are stuck in the 2010's. OpenJDK used to have more compatibility issues.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: