The Raft Consensus Algorithm (2015)

lpage · 2023-09-03T14:29:32

Maelstrom [1], a workbench for learning distributed systems from the creator of Jepsen, includes a simple (model-checked) implementation of Raft and an excellent tutorial on implementing it.

Raft is a simple algorithm, but as others have noted, the original paper includes many correctness details often brushed over in toy implementations. Furthermore, the fallibility of real-world hardware (handling memory/disk corruption and grey failures), the requirements of real-world systems with tight latency SLAs, and a need for things like flexible quorum/dynamic cluster membership make implementing it for production a long and daunting task. The commit history of etcd and hashicorp/raft, likely the two most battle-tested open source implementations of raft that still surface correctness bugs on the regular tell you all you need to know.

The tigerbeetle team talks in detail about the real-world aspects of distributed systems on imperfect hardware/non-abstracted system models, and why they chose viewstamp replication, which predates Paxos but looks more like Raft.

[1]: https://github.com/jepsen-io/maelstrom/

[2]: https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/DE...

mananaysiempre · 2023-09-03T22:35:33

> [Viewstamped replication] predates Paxos but looks more like Raft.

Heidi Howard and Richard Mortier’s paper[1] on the topic of Paxos vs Raft has (multi-decree) Paxos and Raft written out in a way that makes it clear that they are very, very close. I’m very far from knowing what consequences (if any) this has for the implementation concerns you state, but the paper is lovely and I wanted to plug it. (There was also a presentation[2], but IMO the text works better when you want to refer back and forth.)

[1] https://doi.org/10.1145/3380787.3393681

[2] https://www.youtube.com/watch?v=0K6kt39wyH0

LAC-Tech · 2023-09-03T18:40:39

The view-stamped replication paper was surprisingly readable - I'd never looked at consensus algorithms before in my life and I found I could kind of follow it after a couple of reads.

https://dspace.mit.edu/bitstream/handle/1721.1/71763/MIT-CSA...

hinkley · 2023-09-04T07:37:48

Don't forget heterogeneous network topologies and the fact that some members make absolutely terrible leaders.

m00dy · 2023-09-03T15:43:52

looks like a great playground to get familiar with ds

eatonphil · 2023-09-03T13:56:09

I had a fun time recently implementing Raft leader election and log replication (i.e. I didn't get to snapshotting/checkpointing). One of the most challenging projects I've tried to do.

The Raft paper is very gentle to read, and gives you a great intuition on its own. Even if you don't want to implement it, you probably use software that uses it: like etcd or consul or cockroach or tidb, etc.

I collected all the resources I found useful while doing it here: https://github.com/eatonphil/goraft#references. This includes Diego Ongaro's thesis and his TLA+ spec.

Some people say Figure 2 of the Raft paper has everything you need but I'm pretty sure that's just not true. It's a little bit more vague than looking at the TLA+ spec to me anyway.

agonz253 · 2023-09-03T14:32:40

This looks great. However, without a truly comprehensive test suite it’s guaranteed to have a great many subtle bugs :). I recommend trying to hook it up with the tests from MIT’s distributed systems course: https://pdos.csail.mit.edu/6.824/labs/lab-raft.html.

It looks as though it would take only a minor refactoring at least for the leader election and log replication.

skowalak · 2023-09-04T09:31:52

I had to implement Raft for a network programming course during my bachelors and I had the same experience regarding how gentle the paper was. Especially for people new to distributed algorithms, I can really recommend it.

My implementation is probably not that great, but I put it online anyway if anyone is interested: https://github.com/skowalak/fastapi-raft/

kevdev · 2023-09-03T13:59:55

Figure 2 is great, but I would agree the entire paper is needed if you are implementing raft. There are a few specifics in the paper that you need when implementing it.

P.S. love your blog :)

eatonphil · 2023-09-03T14:13:36

Cheers!

> Figure 2 is great, but I would agree the entire paper is needed if you are implementing raft. There are a few specifics in the paper that you need when implementing it.

It was more than that. I'm blanking on what it was but there were parts where I really couldn't find anything about the intended behavior in the paper (let alone in Figure 2) except for in Diego's thesis or in the TLA+ spec.

Though maybe I was just not reading the paper correctly.

galenmarchetti · 2023-09-03T14:24:44

If you're interested in consensus algorithms, you might be interested in this book that I used in a theoretical course on distributed system called "Reasoning about Knowledge" (https://mitpress.mit.edu/9780262562003/reasoning-about-knowl...).

You have to invest a bit in learning about modal logic, but once you do get past that part, this book provides proofs of why things like Raft or Paxos work that are super intuitive and straightforward. Basically pushing the complexity of proving these algorithms into the logic structure used to form the proof (in an intuitive way). Highly recommend, changed how I think about consensus!

henrik_w · 2023-09-03T16:00:20

Can't resist posting this (from Classic Programmer Paintings)

“Raft Consensus Algorithm Failure”,Théodore Géricault, 1819

https://classicprogrammerpaintings.com/post/6141087496359280...

harryfyx · 2023-09-04T01:40:08

Raft has a TLA+ proof, which should make it "bug free". I understand your post is a joke, but I wonder if there are actually any errors.

henrydark · 2023-09-04T18:14:52

Actually, AFAIU the TLA+ proof is only for a few small cluster sizes - not for all sizes. And the number of nodes in the painting is definitely above that checked by TLA+...

lucb1e · 2023-09-03T15:49:44

If anyone else doesn't understand what the visualisation is supposed to show, note that you can click on one of the nodes and make them fail. Particularly try this with the current "leader" (the thing that's sending and receiving all the packets). Press the little pause icon next to the first slider to turn it back into a clock and resume the simulation.

Has someone else figured out what the spreadsheet on the right is? It looks broken to me (but so I thought the rest of the simulation was before understanding that it only shows the happy flow by default), as it always remains empty. The clickable elements I discovered so far are the two sliders, the clock/pause icon, and the individual servers.

sethev · 2023-09-03T17:45:50

It shows the state of each replica's log. Click on the leader and select 'request' to simulate sending a command. If you take a replica offline, you can see that it falls behind and then gets caught up when it comes back.

lucb1e · 2023-09-03T18:35:15

Ah, thanks, though I pushed that button a couple times (also before your comment) but it didn't/doesn't do anything that I can tell.

benreesman · 2023-09-03T23:33:32

The last time I was working in a setting where a rock-solid Chubby-alike under serious load was constantly top of mind was some years ago, and at that time you used ZK if failure wasn’t an option.

But AFAIK people have been putting heavy, heavy work on Raft-based options like etcd and Consul and others for many years now.

Is one of those systems like the new best default? Certainly the conceptual clarity and elegance of Raft seem like things that would show up in performance and reliability, but I’m just dated on this.

What are people (who aren’t at Google or married to GCP) using as the best practices default when the stakes are high in 2023?

I think there’s a production-grade Rust implementation of Raft from IIRC TikV, and a rock-solid, high performance lock server seems squarely in the sweet spot for Rust. Are people using that?

bjornasm · 2023-09-03T20:12:41

Here is their answer to their own question - "What is Raft?"

>Raft is a consensus algorithm that is designed to be easy to understand. It's equivalent to Paxos in fault-tolerance and performance. The difference is that it's decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems. We hope Raft will make consensus available to a wider audience, and that this wider audience will be able to develop a variety of higher quality consensus-based systems than are available today.

After reading that I still have no idea. They are not alone in doing this, but I think its a shame that people don't spend the extra time and effort in properly describing their work.

BirdieNZ · 2023-09-03T21:42:41

Arguably it's a good description because it automatically filters out audiences who don't know what it means; if you're building a distributed system and looking at different consensus algorithms then it's a simple and clear explanation, but if you aren't then it's not a relevant algorithm for you anyway!

The general class of consensus algorithms are for trying to solve the problem of what to do when you have multiple replicas of a data store across several physical devices, when one or many of the devices or their connections fail in some fashion. They are titled "consensus" because the machines need to come to a consensus for what decision to make about a piece of data when a failure event occurs.

For example, you have three servers all replicating the same SQL database:

(A) - (B) - (C)

The network connection linking (C) to the other two drops; (A) and (B) are notified and (B) is promoted to be the primary (it receives the writes and then distributes them to the replicas).

However, (C) doesn't know what's happened, and continues to receive some writes. The network connection is restored, and now (A) and (B) and (C) need to decide what to do. (B) and (C) have both independently received a different set of writes, and the servers need to come to a consensus on what to do with the data.

This is what Raft, Paxos etc. are attempting to solve in a consistent and performant fashion.

rcxdude · 2023-09-04T11:36:30

This isn't really what Raft is used for: conflict resolution is a completely different area of distributed systems (and, unless you have specifically designed your application for this, the situation you describe where C continues to accept writes when the rest of the system has reached a new consensus would be considered a 'split-brain' failure, the kind of thing which tends to get you roasted by Jepsen tests). Raft is used to avoid the conflicts in the first place: the idea is for A, B, and C to agree on some state, or know that no consensus has been made (or at least, that if it has been made, they do not know it and therefore should not act on that information). So in your example it would be used to make sure that A, B, and C agree on the primary initially, and when C is not able to communicate to the others that it does not accept writes. (and in general in high-volume applications like most databases they do not run a consensus on each write, but instead use the consensus as part of their overall strategy for consistency, e.g. through deciding on a leader node).

abraae · 2023-09-04T02:11:38

> Arguably it's a good description because it automatically filters out audiences who don't know what it means; if you're building a distributed system and looking at different consensus algorithms then it's a simple and clear explanation, but if you aren't then it's not a relevant algorithm for you anyway!

Somewhat discourteous to the 99% of people who don't fall into the target audience - you're leaving it to them to thrash around until they work out for themselves they're at the wrong bar.

oasisaimlessly · 2023-09-03T22:34:41

Did you try reading the next paragraph ("Hold on—what is consensus?")?

Sometimes concepts are too "big" to be introduced to someone with no background in a single paragraph; such is life. However, the linked article is a great intro if you manage to keep reading.

Vervious · 2023-09-04T00:15:39

Consensus protocol researcher here. For what it’s worth, I think that the plethora of blockchain research in the last 10 years has made consensus much easier to understand. Raft (in particular, with all of its subtleties) reads (and implements) like Greek in comparison.

For a new beginner to consensus protocols, today, I would start them with Bitcoin, and then move onto Paxos/Tendermint/Simplex, and skip Raft entirely. (Simplex is my paper, a simplified version of PBFT).

woooooo · 2023-09-04T03:08:52

Not a consensus protocol researcher.

It seems like blockchain brings a bunch of extra stuff, complexity and cost? Raft seems comparatively simple to me.

What does blockchain bring that's easier to maintain and harder to fuck up than "elect a leader and replicate a log"?

Vervious · 2023-09-04T11:57:11

A blockchain (without all the bells and whistles) is definitionally equivalent to a state-machine or log replication system (but handling byzantine faults). (Each log entry is just a block).

kevdev · 2023-09-03T13:56:26

I love this site. When I was learning & implementing raft in my distributed systems course, this page was invaluable. Plus the paper itself is pretty easy to read.

dwheeler · 2023-09-03T14:13:18

I teach a class distributed systems, and this site is one of the references. Thanks for making Raft so clear!

jmholla · 2023-09-03T14:06:35

Are there consensus algorithms that don't require changes go through a leader? In many distributed systems, you want to distribute intake as well.

vhiremath4 · 2023-09-03T14:30:21

There are algos which prioritize high fault tolerance and do so by increasing the number of leaders (or the minimum number of nodes that must have a copy of the lookup data).

One such algo is Chord:

https://en.m.wikipedia.org/wiki/Chord_(peer-to-peer)

It’s a peer-to-peer ring of nodes which have their values consistently hashed between them. The network leverages these things called “finger tables” which essentially store replication information in the form of a table. This table can have information which is incorrect or outdated and the peer you go to can tell you to go to another peer (usually the “next”/“successor”) until you find the value (or don’t).

Reason this algo can be used with no “leader” is because it can also work by just going to a node and doing a linear scan across all nodes. You don’t need a thumb table to speed up queries.

_benedict · 2023-09-04T07:44:13

Leaderless protocols exist in the literature (EPaxos, Caesar, Tempo, ...). These systems usually have a "leader" per transaction, but the leader role is uncontested in normal operation, as it is self-declared by the coordinator of the transaction, so an election only takes place if the coordinator of that transaction fails for whatever reason. Multiple leaders may exist concurrently for any given datum/key the operation involves.

Cassandra is currently developing a leaderless protocol in this vein called Accord. In fact, Cassandra is already using a leaderless protocol for its LWTs; an optimised variant of classic (single-decree) Paxos, but this has significant overheads when competing transactions are declared for the same key at the same time.

toast0 · 2023-09-03T16:39:42

Electing a leader and sending changes through the leader simplifies the system and improves throughput and predictability in the presence of contention.

If your transactions are on independent topics, you can distribute the load by sharding leaders: assign ranges of the key space to different leaders and manipulate the elections so each node has a reasonable share of leadership.

You can go leaderless and structure each write as essentially an election: broadcast the tenative transaction (or a request to transact, if the transaction is large enough) to all nodes, if you get a quorum of acceptance, you win and can commit the transaction. But if multiple nodes attempt transactions near the same time, consensus may be time consuming. If you have many nodes, and they all have pending transactions for the same topic, electing a leader and sending all transactions through the leader is going to be a lot faster than establishing consensus on every transaction individually.

lostcolony · 2023-09-03T19:43:56

Or more succinctly - a quorum agreeing to a leader to serialize writes through is both less work than trying to get a quorum to agree on every individual write, and equally as consistent.

mrkeen · 2023-09-03T14:53:29

Paxos doesn't favour any node above any other node.

That said, if you're trying to distribute for the purpose of throughput, then it might be more efficient to only need one leader rather than the quorum that Paxos requires. Only speculating though.

Paxos is also more efficient if calls go to the same place each time - you avoid contention and revotes, etc.

havnagiggle · 2023-09-03T15:31:42

There's a sharded paxos with a shardmaster. Essentially each shard has its own paxos consensus.

LAC-Tech · 2023-09-03T18:35:29

I think in this case there's no longer a 'consensus' - since any node can accept any write immediately.

Look into CRDTs, which are essentially algebraic laws for systems that do this and are guaranteed to eventually converge.

User23 · 2023-09-03T14:24:06

This paper[1] discusses “termination detection” which consensus can directly be built on with trivial proof. It assumes a single environment, and while mathematically that is a kind of graph node it really maps to all outside sources of input and not a cluster member.

[1] https://www.cs.utexas.edu/users/EWD/transcriptions/EWD06xx/E...

Vervious · 2023-09-04T00:04:46

Not really; there are algorithms where many nodes propose proposals in parallel, but eventually only one of those proposals gets finalized (or they maintain only a partial ordering, not a full ordering). (See Bullshark and redbelly iirc)

The main issue is that someone has got to pick a final ordering. If there’s a universe of millions of possible orderings it seems that the most efficient way is to have a single node pick a single one.

darkmarmot · 2023-09-03T14:21:16

Many systems have lots of leaders to handle concurrency. Ours has millions of them for instance.

Racing0461 · 2023-09-03T19:02:42

The original dynamo paper (not dynamodb, its predecessor) didn't use leader election. it used quorom with some speed hacks (hinted handoffs and sloppy quorom)

dboreham · 2023-09-03T14:18:09

Blockchains

Rhapso · 2023-09-03T14:22:44

No, those only allow 1 leader and 1 intake at a time. They just pick the leader by proof of waste first.

lucb1e · 2023-09-03T15:44:51

I think you may be confusing proof of work with blockchain as a general principle. (I'm not a fan of this cryptocurrency tech anymore either, my comment is purely about the technical properties which can be interesting to study.)

Detrytus · 2023-09-03T20:32:39

No, GP is right. Without Proof of Work blockchains have no leader - think about Git repo with multiple branches: which one is main? Nothing in blockchain itself determines this, only we, humans chose to name one of them "main" (formerly "master").

lucb1e · 2023-09-03T20:55:46

> Without Proof of Work blockchains have no leader

But, proof of stake?

Vervious · 2023-09-06T04:44:45

proof of stake blockchains certainly have leaders. MEV doesn't exist if there aren't designated block proposers

henrik_w · 2023-09-03T16:04:03

When I was studying the Raft algorithm a year and a half ago, I found this video on it by John Ousterhout to be a good complement:

Designing for Understandability: The Raft Consensus Algorithm

https://www.youtube.com/watch?v=vYp4LYbnnW8

63 · 2023-09-03T14:15:22

> Raft is a consensus algorithm that is designed to be easy to understand.

> Consensus typically arises in the context of replicated state machines, a general approach to building fault-tolerant systems.

I recognize that I'm not the intended audience but I do think I would be a lot more capable of understanding this article if it used less jargon or at least defined what it meant. I'm only mentioning this because ease of understanding is an explicit goal.

Can someone give a real world example of where this would be used in a production app? I'm sure it's very practical but I'm getting caught up in trying to understand what it's saying

galenmarchetti · 2023-09-03T14:22:05

Consensus algorithms underly distributed databases like Cassandra, or distributed caches like Redis. (in fact Redis does use a version of Raft).

Simple use case, you're running a database across three servers, so that users can ping any one of the servers and it still works (fault tolerance).

User X pings server A to write something to a table. User Y pings server B to read from that table.

How do you make sure that User Y reads what User X wrote, so that User X and User Y are on the same page with what happened?

That's the consensus algo

couchand · 2023-09-03T14:34:43

Not to be too pedantic, but the step of actually shipping those changes across servers is usually outside the scope of consensus algorithms. Generally they are limited to picking a single server to act as a short-term leader. That server then is responsible for managing the data itself.

Though you can conceive of a system where all data flows through the consensus algorithm, practically speaking that would introduce significant overhead at a granularity where it isn't adding value.

There isn't neceasarily one dictator for the whole cluster, but rather usually they are scoped to each domain that requires serialization.

dcchuck · 2023-09-03T14:56:19

Simplest terms - your app needs several different services or instances of a service to agree upon a value. There are a lot of reasons you can't use things like the system clock to agree upon when something happened for example - this is where RAFT steps in.

You'll see "fault-tolerant" and "replicated state machines" often alongside them. Let's break those down in this context.

For "fault-tolerance" - think production environments where I need to plan for hardware failure. If one of my services goes down, I want to be able to continue operating - so we run a few copies of the app, and when one goes down, another operating instance will step up.

In that case - how do we pick what's in charge? How do all copies agree on things while everything is working smoothly? Raft.

For "replicated state machines" - let's stay in this world of fault-tolerance, where we have multiple instances of our app running. In each service, could reside a state machine. The state machine promises to take any ordered series of events, and always arrive at the same value. Meaning - if all of our instances get the same events in the same order, they will all have the same state. Determinism.

This is where it all comes together, and why I think the jargon becomes tightly coupled to an "easy to understand" definition.

You will reach for replicated state machines when you need deterministic state across multiple service instances. But the replicated state machines need a way to agree on order and messages received. That's the contract - if you give everything all the messages in the same order, everything will be in the same state.

But how do we agree on order? How do we agree on what messages were actually received? Just because "Client A" sends a messages "1", and "2', in a specific order does not guaranteed it is delivered at all, let alone in that order.

Raft creates "consensus" around these values. It allows the copies to settle on which messages were actually received and when.

So, you could use other approaches to manage "all your service copies getting along" but a replicated state machine is a nice approach. That replicated state machine architecture needs some way to agree on order, and Raft is a great choice for that.

adunk · 2023-09-03T14:23:10

Wikipedia has a nice list of software using the algorithm: https://en.wikipedia.org/wiki/Raft_(algorithm)#Production_us...

oriettaxx · 2023-09-03T18:51:18

apparently there is no reference to Docker in swarm mode, where it is used to decide what node should be the Leader

uw_rob · 2023-09-03T14:42:54

Think databases which run across many different machines.

Distributed databases are often conceptually modeled as a state machine. Writes are then mutations on the state machine.

With a starting state (empty database), if everyone agrees that on a fixed list of mutations which are executed in a rigidity defined ordering, you will get the same final state.

Which makes sense, right? If you run the following commands on an empty database, you would expect the same final state:

1. CREATE TABLE FOO (Columns = A, B) 1. INSERT INTO FOO (1, 2) 1. INSERT INTO FOO (3, 4)

Which would be:

``` FOO: |A|B| |-|-| |1|2| |3|4| ```

So where does "consensus" come into play? Consensus is needed to determine `mutation 4`. If the user can send a request to HOST 1 saying 'Mutation 4. should be `INSERT INTO FOO (5, 6)`' then HOST 1 will need to coordinate together with all of the other hosts and hopefully all of the hosts can agree that this is the 4th mutation and then enact that change on their local replica.

This ordering of mutations is called the transaction log.

So, why is this such a hard problem? Most of the reasons are in [Fallacies of distributed computing](https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...) but the tl;dr is that everything in distributed computing is hard because hardware is unreliable and anything that can go wrong will go wrong. Also, because multiple things can be happening at the same time in multiple places so it's hard to figure out who came first, etc.

RAFT is such an algorithm to let all of these hosts coordinate together in a fault tolerant way to figure out the 4th mutation.

Disclaimer: The above is just one use of RAFT. Another way RAFT is used in distributed databases is as a mechanism for the hosts to coordinate a hierarchy of communicate among themselves and when a host in the hierarchy is having problems RAFT can be used again to figure out another hierarchy. (Think consensus is reached on leader election to improve throughput)

bjornasm · 2023-09-03T20:15:55

Made a similar comment as this. If it is for a wider audience etc you would think it would be beneficial to explain it in a way that the wider audience understands.

maxpert · 2023-09-03T21:34:16

I've written a whole SQLite replication system that works on top of RAFT ( https://github.com/maxpert/marmot ). Best part is RAFT has a well understood and strong library ecosystem as well. I started of with libraries and when I noticed I am reimplementing distributed streams, I just took off the shelf implementation (https://docs.nats.io/nats-concepts/jetstream) and embedded it in system. I love the simplicity and reasoning that comes with RAFT. However I am playing with epaxos these days (https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf), because then I can truly decentralize the implementation for truly masterless implementation. Right now I've added sharding mechanism on various streams so that in high load cases masters can be distributed across nodes too.

htowerad3242 · 2023-09-03T16:13:55

It's a shame many undergraduate CS curricula are allergic to distributed systems and type systems.

Even the grad program I was looking at is hot dog water.

I've been playing with raft and paxos. Employers will not care as these were learned out-of-band from degree mills.

EGreg · 2023-09-03T13:54:22

Is Raft byzantine-fault-tolerant though? Can it be made so?

Paxos can: http://en.wikipedia.org/wiki/Paxos_(computer_science)#Byzant...

posix86 · 2023-09-03T14:00:01

It assumes no malicious agents. If you're malicious you can take control over everything with a single message (advertise being the leader of the next term).

dwheeler · 2023-09-03T14:14:14

Raft cannot handle Byzatune faults by itself. There are variant algorithms that can.

kevdev · 2023-09-03T13:55:26

Raft is not byzantine fault tolerant.

sethev · 2023-09-03T17:57:40

There's a BFT version: https://www.scs.stanford.edu/17au-cs244b/labs/projects/clow_...

u320 · 2023-09-03T14:46:50

Paxos is a family of algorithms of which Raft is a member. You can change Raft in various ways but then it's not Raft anymore, but it can still be Paxos. So of course Paxos can "do more".

intelVISA · 2023-09-03T16:30:46

No, Raft is mostly for learning the basics of distributed consensus afaik.

dang · 2023-09-03T21:08:51

I took a crack at finding the interesting past related threads. Any others?

(I've left out posts about particular implementations, extensions, and so on—there are too many. The intention is threads about the algorithm itself.)

Raft Is So Fetch: The Raft Consensus Algorithm Explained Through Mean Girls - https://news.ycombinator.com/item?id=33071069 - Oct 2022 (53 comments)

Raft Consensus Animated (2014) - https://news.ycombinator.com/item?id=32484584 - Aug 2022 (67 comments)

Why use Paxos instead of Raft? - https://news.ycombinator.com/item?id=32467962 - Aug 2022 (45 comments)

In Search of an Understandable Consensus Algorithm (2014) [pdf] - https://news.ycombinator.com/item?id=29837995 - Jan 2022 (12 comments)

Raft Consensus Protocol - https://news.ycombinator.com/item?id=29079079 - Nov 2021 (51 comments)

Paxos vs. Raft: Have we reached consensus on distributed consensus? - https://news.ycombinator.com/item?id=27831576 - July 2021 (48 comments)

Raft Visualization - https://news.ycombinator.com/item?id=25326645 - Dec 2020 (35 comments)

Raft: A Fantastical and Absurd Exploration - https://news.ycombinator.com/item?id=23129707 - May 2020 (1 comment)

Understanding Raft Consensus - https://news.ycombinator.com/item?id=23128787 - May 2020 (3 comments)

In Search of an Understandable Consensus Algorithm (2014) [pdf] - https://news.ycombinator.com/item?id=23113419 - May 2020 (26 comments)

Paxos vs. Raft: Have we reached consensus on distributed consensus? - https://news.ycombinator.com/item?id=22994420 - April 2020 (65 comments)

Raft Is So Fetch: The Raft Consensus Algorithm Explained Through Mean Girls - https://news.ycombinator.com/item?id=22520040 - March 2020 (4 comments)

Implementing Raft: Part 2: Commands and Log Replication - https://news.ycombinator.com/item?id=22451959 - Feb 2020 (16 comments)

Building a Large-Scale Distributed Storage System Based on Raft - https://news.ycombinator.com/item?id=21447528 - Nov 2019 (5 comments)

In Search of an Understandable Consensus Algorithm [pdf] - https://news.ycombinator.com/item?id=14724883 - July 2017 (14 comments)

Instructors' Guide to Raft - https://news.ycombinator.com/item?id=11300428 - March 2016 (3 comments)

Fuzzing Raft for Fun and Publication - https://news.ycombinator.com/item?id=10432062 - Oct 2015 (10 comments)

Prove Raft Correct - https://news.ycombinator.com/item?id=10017549 - Aug 2015 (27 comments)

Scaling Raft - https://news.ycombinator.com/item?id=9725094 - June 2015 (12 comments)

Raft Consensus Algorithm - https://news.ycombinator.com/item?id=9613493 - May 2015 (24 comments)

Creator of Raft is speaking at our meetup. What questions do you want answered? - https://news.ycombinator.com/item?id=9351794 - April 2015 (6 comments)

Replicating SQLite using Raft Consensus - https://news.ycombinator.com/item?id=9092110 - Feb 2015 (21 comments)

Raft Refloated: Do We Have Consensus? [pdf] - https://news.ycombinator.com/item?id=9015085 - Feb 2015 (4 comments)

Analysis of Raft Consensus [pdf] - https://news.ycombinator.com/item?id=8736868 - Dec 2014 (3 comments)

The Raft Consensus Algorithm - https://news.ycombinator.com/item?id=8527440 - Oct 2014 (27 comments)

Raft: Understandable Distributed Consensus - https://news.ycombinator.com/item?id=8271957 - Sept 2014 (79 comments)

Raft - The Understandable Distributed Protocol - https://news.ycombinator.com/item?id=6859101 - Dec 2013 (10 comments)

Raft, a scrutable successor to Paxos - https://news.ycombinator.com/item?id=5624627 - April 2013 (2 comments)

candiddevmike · 2023-09-03T13:58:49

Anyone using object storage like S3 for cluster coordination/election instead?

ithkuil · 2023-09-03T16:05:47

Not sure about S3. Iirc they relatively recently implemented consistent writes.

But this article explains how to "piggyback" on stores providing consistent writes in order to implement your own little leader election Algo on top of a foundation layer

https://cloud.google.com/blog/topics/developers-practitioner...

hotpotamus · 2023-09-03T15:27:10

I want to say that jgroups has an S3 coordination method. I came across quite a few methods while trying to set up a Keycloak cluster in Docker Swarm, but ultimately the Kubernetes DNS method worked best for my case (Swarm seems close enough that I was able to use it with no problem). I'd also note that my understanding is that Swarm uses the Raft algorithm.

cweld510 · 2023-09-03T19:05:13

I think S3 is too slow to put in your control plane’s critical path; better to use something like Redis.

valzam · 2023-09-04T00:47:04

Obligatory plug: http://nil.csail.mit.edu/6.824/2022/schedule.html

MIT Distributed Systems course where Raft is implemented as a class assignment (assignment 2). The test suite around the assignment is incredibly valuable and !will! find bugs in your Raft implementation. The assignment is broken up into distinct steps to help people not get stuck when doing everything at once. It is still very challenging to implement everything, especially to get the performance tests to pass.

_daor · 2023-09-03T15:20:52

[flagged]

capableweb · 2023-09-03T15:28:54

> ??? That isn't easy to understand

Distributed systems aren't simple, but compared to Paxos, Raft is definitely on the simpler side. I think they're also having a specific audience in mind (distributed systems engineers) rather than the public, when they write "it's easy to understand", which makes sense.

badcarbine · 2023-09-03T14:12:15

mrkeen · 2023-09-03T15:18:33

* 1 computer will eventually break (or one service will crash, or be replaced). To keep processing requests, you need more than 1 computer.

* Two-or-more computers can be fed conflicting information.

* Consensus algorithms prevent such conflicts from being accepted by the system.

* Raft is one such algorithm. It operates by choosing a leader and then routing all requests via that leader, so that no observers will see a different sequence of requests.

quest88 · 2023-09-03T14:56:25

You need to at least try.

badcarbine · 2023-09-03T15:15:03

You need to, too.

cweagans · 2023-09-03T16:38:21

http://thesecretlivesofdata.com/raft/