Hacker News new | past | comments | ask | show | jobs | submit login
Khepri is a tree-like replicated on-disk database library for Erlang and Elixir (github.com/rabbitmq)
70 points by dr_linux 47 days ago | hide | past | favorite | 18 comments

My understanding after spending a few minutes reading through the project web page and stack overflow listings about RabbitMQ clustering problems is as follows.

Khepri is a project intended to replace Mnesia for replicated, clustered distributed systems written in Erlang/Elixir such as RabbitMQ. The primary reason this project started was to address shortcomings of Mnesia with regards to "network partitions", where cluster nodes are running on unreliable networks and network failures happen.

Caveat emptor! The project README still refers to Khepri as an alpha product, so I assume that RabbitMQ continues to use some kind of customized Mnesia for its production system. Unfortunately, in the absence of any significant system using Khepri, should you decide to adopt it you will be on the bleeding edge and will have scars for using it, at best, and a complete project failure at worst. Distributed systems problems are very hard problems to solve.

As one of the authors of Khepri, I should probably clarify what we mean by "alpha" quality in the README/documentation.

The data is handled by our Raft library underneath which is production ready, so for a given version of Khepri, the data is safe.

However, the public API and internals of Khepri are unstable. Therefore, as a user of Khepri, you may have to adapt your code when upgrading Khepri to a later version. If the internals change too, you may need to work on some migration tool to export the existing data and import it in a new instance. This is not something Khepri will do for you with an upgrade before it reaches 1.0.0.

And you are right that RabbitMQ releases don't use Khepri at all for now and rely entirely on Mnesia. Future releases will introduce Khepri gradually, for vhosts and users first, and more types of records with each minor release.

This is only tangentially related to Khepri, but I'm curious why the STEM community started treating the word "data" like a plural noun, e.g., "Data ARE stored in a tree structure"?

I know in Latin, "data" is plural and "datum" is singular, but we aren't speaking Latin, and even if we were, we're not doing so consistently. For example, we don't say, "The meeting's agenda ARE up to date," even though "agenda" is plural and "agendum" is singular. Instead, we adapt the word "agenda" to our grammar and use it as a singular collection, "The meeting's agenda IS up to date," similar to the way we say, "The population IS growing."

To me, saying "Data are..." is pretentious, like when people use the word "an" before words that start with a consonant.

EDIT: Fixed a capitalization typo.

> To me, saying "Data are..." is pretentious, like when people use the word "an" before words that start with a consonant.

Afaik that is only done for words, which sound like they start with a vowels and I think it is done for easier pronunciation. Why is that pretentious?

That's not what I mean. The grammar rule isn't "a" before vowels and "an" before consonants, it is "a" before words that start with vowel sounds and "an" before words that begin with consonant sounds, regardless of what letter the word actually begins with (e.g., "an hour" is correct, even though "hour" starts with an "H"). This is normal.

What I'm talking about is when people reverse this rule to make themselves sound smarter. For example, I have worked with a few people during my career who would say things like, "I will need an pen and an paper to take notes." When I asked why they did that, they'd parrot back the grammar rule for who vs. whom, which is entirely wrong. This is what I mean by pretentious.

When I wrote that documentation (and comments in the code), I searched if "data" was singular or plural in English (which is not my mother tongue), because I had no idea what it was. According to the sources I found at that time, there were no clear definition, but it looked like to me it was mostly plural.

I meant no disrespect to you and apologize if it came across that way. It's a weirdness (in my opinion) of English and how we adapt Latin words into our grammar in inconsistent ways.

"Started"? That both forms are used, somewhat depending on field, is not really a new thing.

It's relatively new to me. When I was in college, books like "The Future of Data Analysis (JW Tukey, 1962) used "data is." I believe all of my CS textbooks treated the word "data" as a singular collection back then. I've only noticed the ubiquitousness of "data are" in engineering books since the turn of the century.

It could be a false memory though, which is why I'm asking.

My impression from styleguides etc is that the singular form is the newer one, and in some cases only somewhat begrudgingly accepted. E.g. recommendations that it is acceptable to use singular when writing for a "general audience", not in "formal or scientific writing", others referring to plural as "still in common use" (but singular taking over). But maybe there's been indeed a return to it in this specific field? not sure.

Thanks for the explanation. +1.

Data is the latin plural of the latin word datum...which for some reason is out of use and we now call that "data point".

Anyway, "data" as plural is a couple of thousands years old.

Yes. I mentioned this (and a comparison to the way we use agenda vs. agendum) in my original post.

In the '90s, two brothers, friends of mine, had built a series of commercial software using a similar implementation - double-entry accounting, a construction project evaluator, and several others. When everybody was using dBase and FoxPro at the time and struggling, the speed of development and the low hardware requirement of their products was unmatched.

This is interesting in that it uses Raft. I had a go at implementing a toy Raft implementation and I think testing my particular implementation properly would be pretty hard to do without TLA or automated proofing algorithms. You could also use Jepsen.

I maintained a clustered RabbitMQ cluster and used Chef to coordinate upgrades. Uprading Erlang AND RabbitMQ at the same time wasn't very enjoyable since it had the potential to go wrong in production. Fortunately it went right and nobody noticed.

If RabbitMQ can solve the partition problem with Khepri that would be great. The userguide tells how to configure RabbitMQ to handle splits.


I think we used pause_minority which sacrifices availability for consistency.

Quorum queues already solve this problem today. They pass a "harsher" fork of Jepsen, according to the RMQ team.

Khepri is a newer project that reuses the Raft library developed for quorum queues to store metadata, as a replacement for Mnesia, Erlang's built-in distributed DB. I think that means things like vhosts, users, permissions, etc.

https://rabbitmq.github.io/khepri/ has a bit more information on why you might want to use this, from what I can understand. It's a bit over my head. I guess its sort of simpler to manage a bunch of data on a disk vs a regular db (when not considering that just a bunch of data on disk), mostly around network issues?

Khepri is an interesting alternative to Neo4j for a program I wrote to store tree data a few years ago. Of course, I'd have to try it out with actual data, but it seems performant in the few minutes I've been fiddling with it.

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact