Hacker News new | past | comments | ask | show | jobs | submit login
Implementing a distributed key-value store on top of implementing Raft in Go (eatonphil.com)
272 points by simjue on May 25, 2023 | hide | past | favorite | 79 comments



These kinds of posts are my favorite part of HN. The deep dives into LLM's, state machine theory, fourier transforms, locking data structures, and memory allocation that is miles above the basic posts around the internet you get with searching - but not quite a full book yet.

The author spent 7 months tinkering and cared enough to come back and share that with us.


Phil rocks. Highly suggest following him on twitter. He’s pretty outspoken about writing. This is a great post of his that I need to reflect on more: https://notes.eatonphil.com/is-it-worth-writing-about.html


Glad that post resonated! <3


I wish there was a separate catalog of these posts which I could browse through.


We need Hacker Hacker news, with the more serious tech related posts too techie for hacker news.

In all seriousness, I actually feel bad being here sometimes as a layman trying to learn what engineers and all the smarties here think about news. Imposter syndrome from commenting on an internet forum, how sad is that lol


A side topic: how can a company find people like the author, who can really articulate how to build a system? I've been interviewing senior engineers for years, and more than 95% of them, if not more, are really box drawers. They throw terms around confidently but fail miserably the moment I ask for specifics. No, I don't ask for details like how Raft works. That level of details probably would fail all the candidates. Just simple things like how you organize the data, how you route the traffic, how you manage metadata, or how your data path keeps metadata in sync. Really basic ones, yet most interviewers fail. Indeed, the more senior they are, the more likely they lose touch of the details -- even those who claim to work on distributed databases.


I recently interviewed with a company, and the interviewer asked me "how do you organize data." I wasn't sure if they wanted to talk about classes, modules, databases, k-v stores, hashing data and routing distributed requests to the same pods. I asked, and they answered "I mean in general, how do you organize data."

After talking for a bit about pretty much all of the above, the interviewer asked "have you used dictionaries?"

The reason I'm telling the story is, if a lot of your candidates fail to answer your questions, the problem might be in the question.


I would have halted for a moment and asked something like: organize data for what purpose? At which point I would expect there to be some amount of clarification. As it sounds to me like they're talking about organizing some set of data for lookup, as opposed to organizing data around how it flows through an application, for example


I often find when you are given questions like that no matter how you probe you get a really defensive interviewer who doesn’t want to give away the answer they are expecting. I’ve been in very similar situations with vague questions and I’ve tried to probe for more details and just been met with defensiveness with an attitude that says they expect me to know exactly what they mean


It is worse: your attempt to clarify the question may be regarded by some interviewers as a sign you are not senior dev (they think [erroneously in my opinion] that "senior" means that you must figure out fine details yourself).


This seems odd to me, though its entirely possible my experience thus far hasn't been this. I've been told at the last 3 places I worked that one of the reasons I was hired was how inquisitive I was during the interview stage, asking lots of questions before coming up with solutions

I am most definitely not ruling out what you're saying may be common case and I am just very lucky, but I do want to act as a sort of counter point to see if others can weigh in so we can get more of an industry sense around this


Anecdatum: my experience has been the same as others - asking questions about the interview questions generally leads to a bad outcome. In 25 years, I reckon it's less than a handful of times where asking questions of the interviewers has actually got a positive response.

Hell, even when in jobs, asking questions about projects I'm assigned has sometimes got a negative response...


Seems weird to me that asking a specific question about what it is you're trying to achieve would be negative. Especially since most interviews often open with feel free to ask clarifying questions or something along those lines.

If the prompt is unclear its worth getting clarity, just like if something is unclear in the job you seek clarity, I feel like people not asking questions would be a big red flag.

of course, asking too many (this is subjective but I think we can all think of a reasonable situation where there were too many questions being asked relative to their value) could be a red flag


> just like if something is unclear in the job you seek clarity

There are a distressing number of managers / leads I have encountered who consider that if you're asking for clarity, you're impugning their powers of explanation because CLEARLY they explained it well enough (after all, they understand it!) and you're either an idiot or being sarcastic to undermine them.


Could it be because the clarification questions are perceived as not understanding? Would it help if one prefaced with something like "Well, there's many different ways like X, Y and Z and the best one depends on the details of the requirements. Did you have something specific in mind and if so what are those requirements?"

By mentioning a few ways you'd show you're aware of various solutions, and you'd also provide context for why you're asking probing questions.

Then again, I haven't interviewed in quite a while (love my current job) so...


> Could it be because the clarification questions are perceived as not understanding?

Sure but taking that as "the asker is an idiot" rather than "I may not have explained this well" is all too common (I know I've been guilty of this more than a handful of times.)

> Would it help if one prefaced with something like [...]

I think if you're having to carefully phrase your (reasonable, obvs.) questions in order to avoid upsetting the interviewer, that's a bit of a red flag, no?


When I interviewed for software development, I used to have a very open-ended question about optimizing a system I would describe, and there was no right answer, there were tons of directions to go in, and the point was to get the interviewee to ask questions so I could figure out if they'd been around the block and which blocks they had been around.

Why would you only want cookie-cutter "correct" answers to your questions? That doesn't tell you shit about the candidate?


One a rare occasion I actually got feedback after the interview (it was for a startup), It was because I asked 'too many questions' it was decided that I'd need way more mentoring than they can afford for the senior position I was applying for.


That's infuriating. There's a clear distinction between asking "how do I do this thing?" and "what do you want me to do?"

I can't read your mind, and that goes both for interview questions and determining product needs.


> After talking for a bit about pretty much all of the above, the interviewer asked "have you used dictionaries?"

Hahaha classic.

I was once in an interview (which I failed) and was asked a problem related to some sort of monotonic queue. I wrote the solution and was going through it, when the interviewer asked: 'what CS concept are you using in this solution'? I didn't really understand what he was asking for and in turn I asked for more clarification like 'Are you referring to the data structures I used? Or the technique (it was some sort of greedy)?' 'no'. After a couple of back-and-forth questions and answers, he finally tells me he was looking for me to say 'a state machine'. Seriously?


"I was once in an interview (which I failed)"

You didn't fail that interview. They did.


I have had that exact experience before. It is fine, you realise the company is toasted and you move on (the place that asked me this had just got rinsed by an offshore consulting firm, ecomm site built with a graph database, comedic stuff).


There are plenty of different types of dictionaries, organizing their data differently.


Do you understand the difference between writing a blog post about something and being able to talk about it in an interview? The majority of work in software is not being an "expert" on any one thing because the one thing that employers require is changing quite frequently. Rather it is being able to build up experience and carrying knowledge forward to new areas.

Most technical interviews are conducted in a completely inane way where the goal is to memorize pieces of information. If that was how software worked, the best engineer would just be Google. 95% of people who know enough to build this system would likely be unable to answer your questions because, more than likely, they do not currently work in a job that requires them to use this information every day. This does not mean they do not understand it or are unable to build it.

The reason why companies can't find people like this is a combination of not understanding what software development is (and not understanding people, it is really the blind leading the blind but technical people are often as bad as interviewing).

You are asking completely wrong questions. You aren't starting from the right place.


> Rather it is being able to build up experience and carrying knowledge forward to new areas.

Then I'd expect the candidate could articulate her experience and knowledge in some way, no? Otherwise, how would I know the candidate has the built-up expertise? Of course, I assume we can only have interviews. Otherwise, we can have other means, like mini-project, onsite project, or a writeup. Some candidates do like the alternatives more, and some not.

> where the goal is to memorize pieces of information

I disagree. The goal is to see if a candidate does have what the candidate herself claims to know. I find it hard to imagine that a candidate claims to be an expert in a field yet couldn't articulate even one thing in depth for hours if not days, let alone 30 minutes of interview time. Note this is not about any specific details, but the general picture and insights that an expert can convey. This is like a PhD oral defense. The candidate talks about the topic that the candidate is familiar with, and the professors dive in on such topics. I don't see what's wrong with that.


Candidates aren't defending a PHd. The interviewer is nowhere near competent enough for this.

> The goal is to see if a candidate does have what the candidate herself claims to know.

The process you are using does not tell you that. You will notice that your explanation is full of things that you expect, not based in an understanding of how reality actually works. And the result is, unsurprisingly, one where you do not understand the output.


Maybe you're the problem? Looking at your questions, you're asking for opinions, not answers or solutions. Putting people on the spot is bad form in my opinion. If I have a technical challenge I want you to solve, I'll give you the time for that so I can actually see your work and discuss it with you.

I've seen people that can talk the talk but not walk the walk and vice versa, they can't articulate well, but their approach and work speaks for itself.


Maybe I am. For a senior enough role, my questions are open ended. I'd expect the candidate to drive conversations and examine alternatives. I ask follow-up questions based on what the candidate mentions. Yeah, such may be indeed their opinions, but at least they should be opinions derived from assumptions, constraints, and fundamentals.


That’s a terrible way to interview. Interviews are high stress, time constrained settings where you’re talking to people you don’t know. If you want thoughtful answers, you need to be precise in what you’re asking. You can’t be like “tell me about yourself” and then complain I didn’t get from the candidate whether steak is their favorite food.


I don't think that it is a terrible way to interview to all, for a senior position.

A generic question like "how do you organise data in this situation" is open-ended enough that it merits followups that the interviewee has to drive. If I was being interviewed, I'd ask "what are the operations you want on the data (random lookup vs scan, append, lifo access etc), what performance guarantees you want on those operations, what persistence and distributed fault-tolerance guarantees if any, etc.

It is no different from having to be in front of a client and teasing out what the client really wants. The interviewee has to guide the questions and answer them in a satisfactory way. Reminds of Prof. Manuel Blum (then at Berkeley), whose class project would be "think of a problem and solve it, and you'll be graded on both the quality of the problem and the solution".


I am not disagreeing with what the question is supposed to do and how candidates are supposed to answer.

The problem lies in the fact that it creates too much ambiguity in terms of how the candidate is evaluated. Some interviewers may give you credit for your approach, but like the parent, more often than not, you’re evaluated on getting to a specific question or direction, in a specific amount of time. And that just creates a bias towards you hiring people who have a chain of thought exactly like you, which may tbh be not correct chain of thought at all.


In a f*king database like everyone else, are you happy now? What have you learned?


You seem confident you wouldn't have failed the author of this blog post, but I wouldn't be so certain.

Interviews are a high pressure situation where people want an immediate answer, without much of any time to think, let alone research options or run an experiment. You get people's absolute worst possible results.


Where do you find the jobs where people actually care about engineering? Because real work is mostly careful plumbing without breaking what works, but in interviews you must leetcode. Real systems only work when kept reasonably simple, but system design interviews are all about drawing boxes.


I have worked in distributed databases. I can't seem to find any host organization that's actually building anything and really cares about those details. I know several other senior systems engineers in the same boat. where are you guys hiding?


I think there are plenty of companies building stuff like this, and hiring for it. The problem is going to be that the founding engineers have already hammered down (or attempted to) the bulk of the initial interesting engineering problems and what they want is foot soldiers to complete, fix, patch up, and maintain their design.

That's ok, because that's what the software engineering profession is, on the whole, for better or for worse.

But... the interview process should be clear about that, and the questions reflect that. Rather than "how would you build a high performance distributed database" the questions in those cases need to revolve around identifying problems in a design and how benchmarking, diagnostics, bug fixing, etc. would go. Even at the height of VC fever a couple years ago, being blessed with "Hey, here's some $$, go write a [database|operating system|game engine|programming language|other sexy thing]" doesn't happen really (and probably for good reasons.)

So, yeah, GP commenter is a bit disingenuous. There really are only a scant number of jobs which would involve actually being able to do these things, let alone answer how to do them in interview on demand. The jobs that are there are fixing someone else's thing where already they did this stuff.

Said interviewer would likely find a candidate who can answer all those questions... and then hire them to go write some microservices in C# for an insurance company, or join an on-call shift diagnosing production problems with an off the shelf database replication process or something.

Of course, not telling you what you don't already know, just ranting :-)


You find them via their blogs.


Try to interview more senior people, who still remember their C course in an University, or even know what is assembler. Currently, most graduates (from my experience) don't know how an integer is stored in memory.


I don't even know what a good answer to that is. Some integers are stored in registers. Some are stored on the stack. Some are stored on the heap. Or you could be asking how integers are represented, with two's complement being the usual way. Did I pass your ambiguous question? Probably not.


Here is nice interactive and visual representation of how raft works: http://thesecretlivesofdata.com/raft/


This was really great, thank you so much for sharing.


That's really cool!


I was lucky enough to take David Beazley's "Rafting Trip," a five-day training course that guides each student through building their own Raft implementation from scratch:

https://www.dabeaz.com/raft.html

I'd recommend the course to anyone with development experience working with or near distributed systems. David is a fantastic instructor and facilitator, and the blend of student backgrounds led to some great learning and discussion.

(I have no financial or personal interest here; I just loved the course.)


Out of curiosity, did you register as an individual? And if so, do recall how much the course cost you? His courses seem very interesting.


Yes, I registered as an individual (but was reimbursed through my employer's training budget), and did so a few months in advance as the previous session seemed to fill up pretty quickly. It was USD$1500 for the week. I'm keeping an eye on his course listing too - I'd love to take more.


Thank you for the write up eatonphil.

I experimentally implemented Raft in Java but I am not very confident that I did it correctly.

I wish there was a way to implement stateful programs that guarantee "forward progress" and are "steady state systems". I think essentially a state machine that cannot be trapped in a state. Debugging the absence of something of forward moving progress or lack of causation is very difficult.

When there's essentially different actors in the system and they can interact with eachother by communicating, they each have a number of states they can get into. There's no guarantee that the system shall converge on a state that forward progress can be made. Maybe TLA+ is the right answer here.

YMMV but I think (my) reasoning over stateful systems is rather difficult, I think there's lots of hidden states that we cannot easily detect or reason about because they're in our blind spots. Especially related to synchronization. I think it's part of what makes multithreading and distributed systems so hard, because every component can be in a different state and if something is not where it is expected to be, the baton doesn't get passed to the correct state. If you check for something too early, you have a race condition.

If we could see in slow motion what was going on, an interaction between different actors, we could work out why something happens the way it does. But usually the logs are too numerous to get to this detail. I think animation can save us, but what does a Raft animation look like?

How often have you seen an endless spinner? It's as if a completion event was raised but didn't get detected and the system is waiting for something that shall never occur. I want this kind of error to be impossible. This is one form of hidden state that prevents progress.

I wrote an eventually consistent mesh protocol in Python and tested it with Jepsen, it is not linearizable because the consistency level is "eventually consistent".

I don't understand how Raft can scale writes or reads across multiple machines due to the round trip time talking to other nodes.


Microsoft has a library/tool called Coyote* that helps with testing distributed systems; you can write tests/specifications, Coyote will systematically explore nondeterminism in your system and check if your tests still pass. If there's a failure, it'll show the sequence of events that led to the failing test.

I started a project to implement Raft with a KV-store on top, similar to the article, meaning to use Coyote to test it; I didn't get that far before losing interest, though. It's reassuring to read that it took Phil several months to write the code in the post, it's good to know that this is a decidedly nontrivial problem.

* https://github.com/microsoft/coyote


Sounds like a job for state machines like you can build out with a library like xstate[0] (though I'm sure there are similar libraries in whatever language you choose. Python has one called automat[1])

These exist to formalize state logic (current state, computing state, transitioning state etc), you can even produce diagrams based on their definitions. Advanced libraries like xstate even have Actors are part of the core of the library

[0]: https://stately.ai/docs/xstate

[1]: https://github.com/glyph/Automat


Don’t have anything to comment on the rest, but my understanding with respect to:

> I don't understand how Raft can scale writes or reads across multiple machines due to the round trip time talking to other nodes.

Is that at the point where that becomes a bottleneck you scale to multiple clusters, where the read/write destination cluster is determined by a key and whatever your preferred hashing mechanism is.


I think you should try TLA+. I found it surprisingly easy: https://beza1e1.tuxen.de/tla-plus.html

Still haven’t found an opportunity to use it professionally though.


It's not easy. if it was easy, everyone would be using it. I think it's more like thought provoking.


Cool post. Wonder if it would have helped to take a look at the MIT distributed systems course on the web and YouTube - one of the projects is exactly this (Go Raft implementation)

https://pdos.csail.mit.edu/6.824/labs/lab-shard.html


I just don't love learning from videos personally. But I'm sure those are good if you do!


Nope totally agree, can’t just do videos myself either unless it’s a high level systems topic with visuals. Need to get into the code too


Your note about encoding/gob being inefficient is somewhat accurate for how you're using it, but I want to talk a bit about how you could improve your use.

encoding/gob is intended for streams, not stateless marshals/unmarshals. The first thing that is sent over the stream is the type information the receiver should expect, that's why your payload was so large. After the first type is received, subsequent messages are much smaller. You can see this by extending your example to do multiple writes; each write after the first is only 10 bytes: https://play.golang.com/p/Po_iaXrTUER

You have to plan differently, but you could get large improvements to transmission sizes by changing to append only files and creating the gob encoder once per file. If you find you're creating a gob encoder/decoder very often, that's a telltale sign you're not using it as intended.


Another thing I glossed over (unintentionally) is that I only started limiting batch size after I switched to the custom encoder. So persist() being a function of length wouldn't be quite true anymore.

However, I still keep seeing encoding/gob high up in the profiler taking a lot of time doing reflection during RPC calls.

So it does still seem like it's not ideal. Though I may still just not be understanding how to use net/rpc correctly either.


> encoding/gob is intended for streams, not stateless marshals/unmarshals

I don't understand this within the context of the rest of your comment. I use gob for marshaling stuff to storage all the time, I'm not aware of a better way to do that (serialize data to binary).


Sorry, I should have been more precise. encoding/gob is not optimized for situations where you create an encoder or decoder, read/write a single value, then discard that encoder/decoder. As the author noted, payloads for a single call to Encode() are quite large. Additionally, re-instantiating a gob encoder for each call to Encode() is very expensive allocation-wise and benchmarks where this happens will show gob to perform poorly in these scenarios. You can certainly still use gob this way, and if the performance works for you then have at it! But it performs significantly better in situations where you make multiple calls to Encode() with the same encoder.


The parent is saying that gob includes the type information in the serialized payload. This is extraneous information in many cases, and the result is an unnecessarily large payload.


This is essentially do-it-yourself etcd.

https://github.com/etcd-io/etcd


And tidb


TIKV [0] is closer (written in Rust).

0. https://github.com/tikv/tikv


Recently, I have been watching the course MIT6.824. This article appeared very timely :)

Here is another raft implementation in Go https://github.com/eliben/raft


I think that in the specific case of a key-value store (with just a get/put API), there is a more scaleable way to get consensus than putting a KV store on Raft. Raft (and multi-paxos) present a shared log abstraction, which is good if you want to impose a single ordering (what a replicated state machine needs). But writes to different keys are order independent, and you don't need ordering (between keys) to get a linearizable kv store. By going through raft or multi-decree paxos or VS replication, you pay a penalty for ensuring that all writes are sequentialised via the log. The penalty includes having to go through a single leader for all reads and writes, and lots of jitter when a leader fails up until the new leader becomes stable.

In contrast, if you simply use the equivalent of basic (single-decree) paxos per key, the writes can be leaderless. Of course, for performance reasons, each server must batch network and disk i/o.


> if you simply use the equivalent of basic (single-decree) paxos per key, the writes can be leaderless

Is there any writes acknowledgements in what you describe when replication factor > 1?

How do you ensure that I always read the last written value to the same key?


Both reads and writes go through the basic paxos protocol, so the answer to both is yes.


I've never taken the time to really learn paxos in depth, so forgive my ignorance.

I'm not sure that leaderless mode with basic paxos would be really better in the K/V scenario we're talking about.

There would be a lot of round-trips for each request, so there's higher latency than the raft counterpart, especially for reads. I suppose better availability in the case of node outages would be one advantage.

Anything else I miss?


Nope; latency would be about the same, since it requires two rounds for consensus for all ops. Note that Raft itself has no clue about the specifics of the op, whether it is a read or a write, and for leader to be sure it is still the leader, it needs to wait for an answer from a majority of servers. Of course, this is not done for every single op, but batched. Similar methods will have to be used for my suggestion as well.

Now, in many production cases, one tends to put a lease on leader election. Until the lease is valid, the leader is guaranteed that it is still the leader and no one else has a more up to date state. This allows reads to be served from the leader's state machine without having to touch Raft at all.


Thanks for the post!

Another great blog post series about implementig Raft in Go that I found is this one https://eli.thegreenplace.net/2020/implementing-raft-part-0-...


Yes I definitely referred to his posts and project while working on this. I'm a big fan of his blog!


I highly recommend MIT open courseware 6.824. Incredibly valuable for learning distributed systems, and one of the lab assignments is implementing raft in Go. http://nil.csail.mit.edu/6.824/2022/schedule.html

There are a ton of fascinating and potentially frustrating edge cases and gotchas to implementing raft correctly. There's no better way to understand it than actually implementing it, and I probably never would have done it myself without these course materials.


Unfortunately, I hit a roadblock while implementing the Raft assignment. I knew it was simply beyond my capabilities but would have made through if I had anyone I could reach out to. I second this recommendation, but make sure you know what you are signing up for. This course is as hard as they come.


I have found the performance tests very tricky to get to pass without having any input from others. The assignment is really very unforgiving, I would wager the test suite is comparable to how commercial Raft implementations are tested (e.g. https://github.com/hashicorp/raft)


Big fan of Phil!

A lot of us tinkered like him, but very few came back and write the findings nicely in an easy to read form.


Glad to hear it! :)


So kinda like "build your own consul"


I'm pretty sure consul builds on top of hashicorp's Raft implementation so I'd say end result, yes, but that this post goes into the Raft implementation.

And I'm sure consul does plenty my silly key-value state machine doesn't.

I'm not super familiar with it.


Reminded me of etcd, which is exactly this - Raft-based key-value store written in Go.

https://github.com/etcd-io/etcd


epic




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: