To snapshot, I either need to find an epoch such that no older epoch could ever have relevant information for the current node, or I need the nodes to agree on a snapshot. Both seem complicated.
See also her PhD dissertation, "Distributed consensus revised":
As one of the core authors of the Apache Kudu Raft implementation <http://kudu.apache.org/> (which is written in C++) I know that we tried to design it to be a pretty standalone subsystem, but didn't try to actually provide a libraft per se. We wanted to reuse the Raft write-ahead log as the database write-ahead log (as a performance optimization) which is one reason that making the log API completely generic eluded us a little.
That said, I'm currently at Facebook helping to adapt that implementation for another database. We are trying to make it database agnostic, and we continue to find cases where we need some extra metadata from the storage engine, new kinds of callbacks, or hacks to deal with various cases that just work differently than the Kudu storage engine. It would likely take anybody several real world integrations to get the APIs right (I'm hopeful that we eventually will :)
What kind of metadata do you need exactly from the datastore?
1. Kudu uses a special "commit" record that goes into the log for crash consistency, related to storage engine buffer flushes. So we need an API to write those into the log. They don't have a term and an index, since they are a local-engine thing, so they have to be skipped when replicating data to other nodes in the case of the current node being the leader. If we were not sharing the log with the engine, we wouldn't need this.
2. Another database I'm working with requires file format information to be written at the top of every log segment, and it has to match the version of the log events following it. That info has to be communicated to the follower up-front even when the follower resumes replicating from the middle of the leader's log. So we need plugin callbacks on both sides to handle this, in terms of packing this as the leader and unpacking it as a follower into the wire protocol metadata.
Requirements like these will come up and either you hack around them by making some kind of out-of-band call (not ideal for multiple reasons) or you bake the capability into the plugin APIs and the communication protocol.
Frankly, designing generic APIs is also one of the less sexy aspects to consider because we spend so much of our time dreaming about and building all the cool distributed systems capabilities like leader elections, dynamic membership changes, flexible quorums, proxying/forwarding messages, rack/region awareness, etc etc etc. :)
The details of long-tail stuff like this is often hammered out as it comes up during implementation.
At this point, would they even offer anything superior to etcd-raft? Asking from a position of honest ignorance.
I'm actually kinda wondering why this is news now -- the author of this paper is absolutely right of course but I thought this was common knowledge for anyone who knew what raft was. Raft is less robust than Paxos but simpler to implement (and implement correctly) which is why projects choose it. Basically chat (run a paxos round) to decide a leader and build the log (the history of everything that has every happened in the system, see WALs for a similar concept) at the chosen leader instead of by chatting about every single/batch of changes.
- Compartmentalized Paxos (https://mwhittaker.github.io/publications/compartmentalized_...) shows how you can compartamentalize the various roles of a Paxos node to increase scalability.
- Matchmaker Paxos (https://mwhittaker.github.io/publications/matchmaker_paxos.p...) introduces a separate set of nodes called matchmakers which are used during reconfiguration.
- PigPaxos (https://arxiv.org/abs/2003.07760) places the nodes into relay groups and revises the communication pattern to improve scalability. This seems very similar to Compartmentalized Paxos.
- Linearizable Quorum Reads in Paxos (https://www.usenix.org/system/files/hotstorage19-paper-chara...) shows how you can do linearizable quorum reads in Paxos.
In the original Paxos paper (https://lamport.azurewebsites.net/pubs/paxos-simple.pdf), the part 3 (RSM) is not extensively explained. There are countless ways to use Paxos to implement RSM. Multipaxos/Raft/Epaxos try to fill in that gap.
By any means, Paxos itself is 10x simpler than Raft or whatever. Every time I heard a "distributed system" engineer said Paxos is complicated, I know he/she does not have much experience in the field or at least has never implemented the core consensus part...
EDIT: For others, here's a very comprehensive (as of ~2018) review of Paxos-related distributed consensus algorithms with an exposition for each one: https://vadosware.io/post/paxosmon-gotta-concensus-them-all/ That's 17 in all, excluding the original Paxos paper. IMO, it should be linked anywhere Paxos is discussed. The link has been posted twice before by others on HN, but unfortunately hasn't seen any discussion, perhaps because it speaks for itself.
1. log replication (m)
2. log consistency (n)
3. log execution (k)
Then you will have m * n * k ways of achieving your goal based on different requirements on the three stages.
I've implemented the consensus part of Paxos in 10 minutes. But it's a toy. It's totally useless without the other stuff.
There are two possibilities:
1. You misunderstood Paxos, probably confused that with RSM. For example, Paxos is just the leader election part of Raft. Superficially, it seems different from Paxos, but it what it is under the surface.
2. You implemented RSM incorrectly, probably missed some important features or optimizations. For example, log compaction, membership reconfiguration, pipeline, back pressure on execution, etc.
It is VERY important to differentiate Paxos and RSM. There are tons of optimizations you can do with RSM. But on consensus, there is ONLY Paxos today. Or you do it wrong or you invite something truly new.
1. You underestimate my understanding.
2. You underestimate my experience.
I am sure you know this: both are comparable as a shared log abstraction. Once you have that, the state machine part is trivial. For the replicated shared log abstraction, you need to do the same kind of things in both the Paxos and Raft worlds. The latter is simpler to implement because the paper is written close to an engineer's understanding. i have written several versions of both, in production.
I think the difference in our stances arises from the two ways in which the term 'Paxos' is used. In the part-time-parliament paper, the single-decree paxos is the fundamental building block. I'm assuming this is what you think of as Paxos, and you are right in your assertion. It is defined as the fundamental act of consensus.
However, I have always viewed the single-decree protocol as a pedagogical device; a single write-once distributed register is useless in practice, so the multi-decree protocol is where it becomes useful, and to me is the point of the paper. I say this for a few reasons: a) state machine replication has been Lamport's focus since his early papers, including the 'time clocks and ordering' paper. (b) in engineering-oriented papers such as 'paxos made live', 'chubby' and so on, the term is used as a proxy for a replicated log. The reason for the many variants of Paxos is due to the underlying impulse that basic paxos is not sufficient in and of itself. You need some variant of shared log or equivalently, atomic broadcast.
So, we are both right in our ways. We can have consensus, as long as we sacrifice consistency of the definition of 'Paxos' :)
You can build a toy implementation in a weekend if you have a cookbook. A production-ready implementation takes a bit more.
Maybe out of scope for an undergrad are things like testing high latency/unreliable/jittery connections.
This happens every once in a while on HN: some mentions having done one of these assignments, and immediately gets tackled for it.
Maybe professors aren’t doing a good job conveying the limitations. But also this community is gratuitously hostile to people who have no reason to doubt that the code they wrote from the Raft paper, which passed the test suite, was Raft.
I really hope you read this gently. (As the HN guidelines say, "Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.") I'm not trying to talk down to you or treat you with hostility (and I know that it's really hard to convey that via text). I would just ask that when someone who has professional, real-world experience in something says it is difficult and time-consuming to do it right, you'd avoid assuming they just don't know what they're doing, and that perhaps there are aspects that you haven't considered.
And hey, maybe you or some of the other commenters are just ridiculously smart and focused and can write it in a weekend. But if that's the case, it's pretty uncharitable to push a narrative that it's trivial. Not saying that's what's happening here, but that could be how it's coming off.
This is a proportionate response to an undergraduate trying to sell you his new RDBMS. But most students really did write a b-tree. Academic programming elides the supporting infrastructure that bridges the gap between algorithm and software system. Textbooks aren't generally leaving out 100 pages of extra steps required for the list to actually be sorted or the path to actually be shortest. And so a student is not exactly out of line for thinking that the Raft he was taught is actually consistent in the presence of failure. I'll defer to the community's wisdom that he's wrong! But he's still not out of line.
If anything, I'm worried about these classes instilling false confidence. People who think they know these algorithms may go implement them professionally, and not have anyone around to tell them the full story. Cryptography education is careful to put asterisks around "Textbook RSA." Distributed systems education should probably be doing the same.
Part of why the Raft paper is so excellent is because is does leave you feeling like you could explain/implement the algorithm. I don't want to discourage people from being excited about these ideas, because I am too.
That being said, I am generally frustrated by the lack of humility that many software engineers exhibit. "Easy" is a trigger word for me, and I really think is something that should be expunged from most of our vocabulary when referencing software.
If you feel confident you've built a complex algorithm like this correctly on the first try you probably haven't. Hubris and distributed systems just don't mix.
Verifying correctness to back up our empirical claims is often the hardest and most overlooked part of software engineering.
this is almost independent with how hard is to build a (good) pure implementation of raft that offers a simpler API
I used to follow Jepsen https://jepsen.io/ closely. Those tests are quite comprehensive. But I think even those are not sufficient to quantify the quality of Paxos/Raft.
Tools like TLA+ probably require a few months to learn and become proficient.
Paxos vs Raft: Have we reached consensus on distributed consensus? — Heidi Howard
But anyway, maybe this could be applied to helping people work together cooperatively.
In practice it just seems most efficient to be tolerant of consensus failures and focus on cAP.