
Creator of Raft Algorithm introduces LogCabin - dencold
https://sourcegraph.com/blog/118135010314
======
mavelikara
I have a slightly off-topic question for Diego and other people experienced
with distributed systems.

Why are consensus algorithms always developed as systems, not as libraries?
Zookeeper, etcd and LogCabin all operate as a cluster of processes which other
nodes connect to over a client library.

I can imagine that the distributed-state-machine-replication-mechanism of Raft
or ZAB being implemented as a library where the user has to provide an
implementation of the communication layer. Such a library can be used as a
starting point for building other more complex systems which aim to provide a
low-friction install experience. For example, one good thing about both
Cassandra and ElasticSearch is that they both have homogeneous clusters where
all nodes play the same role. Incidentally, from what I understand, they both
embed a consensus implementation within.

Similarly, a membership service (and failure detector) over gossip protocols
will also be very useful.

An installation guide which starts with "First, install Zookeeper cluster.
Then, install a SWARM cluster. Then ..." is not very appealing. That being the
case, I wonder _why_ there is no mature OSS library which provides these
services. What does HN think about this situation?

~~~
ongardie
This question came up at the CoreOS Fest earlier today too. I think everyone
wants libraries as well as services, and I'd like that too for LogCabin one
day. It's just a bit harder when you start thinking about the details. It's
not impossible though. As one example, CockroachDB is using etcd's Raft
implementation as a library.

Part of the issue is that we want to have libraries with small simple
interfaces, and a consensus library is probably going to be on the large side,
as it has to interface with the disk, the network, and the state machine, plus
membership changes, log compaction, who's the leader, configuration settings,
debug logging, etc.

Another issue is that, as a library, it has to be in the language you're
using. And if that language is C++ or similar, it has to be
compatible/convenient with whatever threading approach you're using.

Then there's performance/memory. Some Raft implementations are designed to
keep relatively small amounts of data in memory (like LogCabin or etcd), and
others are meant to store large amounts on disk (like HydraBase). Some are
optimized more for performance, others for safety.

I think we'll get libraries eventually. Keep in mind Raft is still very young.
Paxos is around 26 years old, Raft is only .5 to 3 years old (depending on
when you start counting). I like to think that Raft lowered the cost of
developing a consensus system/library significantly, but it still takes time
to develop mature implementations. Right now we have a lot of implementations
in early stages; I wonder if some of these efforts will consolidate over time
into really flexible libraries.

~~~
mavelikara
Thanks Deigo! From yours and other siblings comments what I gather is that the
idea of a library is not fundamentally flawed, it is just that the state of
the art has not matured to that level.

------
misframer
I'm not a C++ programmer, and work mainly with Go. I'm curious to know if it
is usual for C++ developers to implement their own event loops for network
transports, as Diego has done here [0]. The other example I know is Replicant
[1], which is used by HyperDex, and it uses a custom event loop too [2].

[0]
[https://github.com/logcabin/logcabin/tree/master/Event](https://github.com/logcabin/logcabin/tree/master/Event)

[1] [https://github.com/rescrv/Replicant](https://github.com/rescrv/Replicant)

[2] [https://github.com/rescrv/busybee](https://github.com/rescrv/busybee)

~~~
ongardie
LogCabin uses its event loop for network operations but then hands requests
off to threads to process. I started out with libevent2, but the problem is it
doesn't deal with having multiple threads very well (error-prone and
inefficient). It's also not as well-documented as the man pages for epoll, so
I ended up using epoll directly instead. What was really lost in the libevent2
-> epoll conversion was platform independence, but I think it might be better
to get that back through well-placed #ifdefs or relying on some other library;
I wouldn't go back to libevent2.

~~~
justinmk
> What was really lost in the libevent2 -> epoll conversion was platform
> independence

Did you consider libuv?

~~~
ongardie
I guess the answer is no, I haven't really looked at it. Anyone have
experience with it in a large-ish project outside of Node?

~~~
apaprocki
Yes -- it is very straightforward to use and "just works". (It is also used in
a similar way as found in Node in the luvit project:
[https://github.com/luvit/luvit](https://github.com/luvit/luvit))

------
windlep
First thoughts in my head....

1) Ah, cool, creator of Raft algo, so some of the 'obvious' mistakes in an
implementation should've been resolved by now (though if ppl weren't trying to
use it in production.... who knows).

2) Great, C++, it should be efficient and fast with consistent RAM usage (Go's
GC is a bit.... eh... still).

3) Oh, you need a C++ client library. :(

I would love to say that API's don't matter, but they do. So, so, so much. If
they didn't, etcd would never have had a chance against Zookeeper. The
Zookeeper folks are looking at adding RESTful API's to allow functionality
ala-etcd, because its obvious a convenient API is a huge win. Any distributed
system solution attempting to gain steam should consider this from the
beginning now, as the bar has been set.

~~~
ongardie
I think I'd have to agree with you now: REST APIs seem to help with adoption.
LogCabin was initially created for use with RAMCloud (
[http://ramcloud.stanford.edu](http://ramcloud.stanford.edu) ), which mostly
hand-rolls its RPC serialization to achieve its extreme performance goals (it
budgets about 1 microsecond in software overhead per RPC). I thought I was
being user-friendly in LogCabin by using protobufs, and at the time, something
as embarrassingly slow as HTTP+JSON was unthinkable. I don't think it's too
late to add a REST API to LogCabin, and it'd be pretty easy to make a REST
proxy. Maybe that's worth doing for easier adoption from other languages. I
also think the CLI client makes the barrier to entry pretty low for people
that just want to test it out.

~~~
hueving
Tons of people are super comfortable with HTTP and have tooling built around
testing and debugging it, even if the performance is crappier.

It's an ugly cousin of premature optimization that you have to deal with if
you're producing APIs for your software. If the API requires anything that
isn't braindead simple, you are going to lose out to competitors due to the
learning curve.

------
rxaxm
diego, thanks so much for raft! i'm a student in the brown class you shout
out, and i can testify as to its relative simplicity and the clarity with
which you guys communicate the ins and outs.

i have a question for you, though. why is raft not concerned with byzantine
failure? the focus on byzantine fault tolerance from the paxos family of algos
(and a lot of the literature/educationally material on distributed consensus)
makes me feel like it's important, but your approach suggests it perhaps
isn't. do you think this focus is a side-effect of the ubiquity of paxos which
is disproportionately concerned with this due to its roots in academia?

~~~
ongardie
It's a good question, and I don't really know where the community as a whole
sits on Byzantine vs non-Byzantine. A few thoughts:

Byzantine is more complex, and most people in industry aren't doing it: there
are a lot of Byzantine papers out there but few real-world implementations. I
think Byzantine is important for uses where the nodes really can't be trusted
for security reasons, and maybe there's easier fault-tolerance payoffs
elsewhere when the entire system is within one trust domain such as a single
company.

Byzantine consensus is slower and requires more servers.

If you don't have independent implementations running on each of your servers,
the same software bug could still take out your entire cluster. You get some
benefit if the hardware fails independently, but you don't get protection from
correlated software outages. Maybe the difficulty in characterizing which
faults a particular deployment can handle makes it harder to sell to
management.

With Raft, we were just trying to solve non-Byzantine consensus in a way
people could understand, and we think it's still a useful thing to study even
if your ultimate goal is Byzantine consensus. You might be interested in
Tangaroa, from the CS244b class at Stanford, where Christopher Copeland and
Hongxia Zhong did some work towards a Byzantine version of Raft [1][2] and
Heidi Howard's blog post on it [3]. But really, Castro and Liskov's PBFT is a
must read here [4].

[1] [http://www.scs.stanford.edu/14au-
cs244b/labs/projects/copela...](http://www.scs.stanford.edu/14au-
cs244b/labs/projects/copeland_zhong.pdf)

[2] [https://github.com/chrisnc/tangaroa](https://github.com/chrisnc/tangaroa)

[3] [http://hh360.user.srcf.net/blog/2015/04/conservative-
electio...](http://hh360.user.srcf.net/blog/2015/04/conservative-election-in-
byzantine-fault-tolerant-raft/)

[4]
[http://pmg.csail.mit.edu/papers/osdi99.pdf](http://pmg.csail.mit.edu/papers/osdi99.pdf)

------
siscia
What is the meaning of such system ?

Why do people need something like zookeeper or LogCabin ?

How does a coordinator came to play ?

I don't know much about distribute system, but I would love to learn more...

~~~
ongardie
You end up needing consensus for a lot of fault-tolerant systems that need to
provide consistent results. For example, if your system allows users to choose
their own usernames, and you're trying to guarantee that all usernames are
unique, and your system needs to automatically deal with server failures, then
I think you also need consensus.

Another way to think about it is that consensus gets you the equivalent of a
compare-and-swap operation in a distributed setting. Just as compare-and-swap
is useful for building synchronization primitives with shared memory,
consensus is useful for building synchronization primitives across a network.

[1] [https://en.wikipedia.org/wiki/Compare-and-
swap](https://en.wikipedia.org/wiki/Compare-and-swap)

------
cperciva
I always wonder about cute names like this... is LogCabin so named because
it's HouseBoat minus C? Or is it instead the integral of 1/Cabin dCabin?

~~~
btown
Almost certainly has something to do with
[https://en.wikipedia.org/wiki/Timber_rafting](https://en.wikipedia.org/wiki/Timber_rafting)
:)

~~~
ongardie
I've certainly been to that page before :)

I come from an academic lineage of log-based projects, from log-structured
filesystems [1] which structure disks as a log, to RAMCloud [2][3][4] whose
durability/recovery aspects are a distributed and partially in-memory
extension of that, to Raft and LogCabin that are built around the concept of a
replicated log for consensus.

LogCabin used to export a log-oriented data model, by the way, where the name
made a bit more sense even. There was some talk of renaming it to TreeHouse
now that it exports a key-value tree, but that one didn't really catch on.

[1] [https://web.stanford.edu/~ouster/cgi-
bin/papers/lfs.pdf](https://web.stanford.edu/~ouster/cgi-bin/papers/lfs.pdf)

[2] [http://ramcloud.stanford.edu](http://ramcloud.stanford.edu)

[3] [https://www.usenix.org/conference/fast14/technical-
sessions/...](https://www.usenix.org/conference/fast14/technical-
sessions/presentation/rumble)

[4] [https://web.stanford.edu/~ouster/cgi-
bin/papers/RumblePhd.pd...](https://web.stanford.edu/~ouster/cgi-
bin/papers/RumblePhd.pdf)

------
halayli
This is a very inefficient way to handle events.

[https://github.com/logcabin/logcabin/blob/master/Event/Loop....](https://github.com/logcabin/logcabin/blob/master/Event/Loop.cc#L144-L154)

an event can be read/write, and each fd:event_type should be able to map to a
different Event::file.

In its current form, if you receive 2000 events, you'll need to unnecessarily
context switch 2000 times.

~~~
ongardie
Maybe, but I don't think it matters much. The comment there explains the
issue, and I vaguely remember my measurements showing that this wasn't a big
deal.

I do think it's an interesting (dare I say) limitation of epoll, and something
I'd think about if I was redesigning epoll.

Please file an issue on github if you have ideas on how to improve this
without adding significant complexity.

------
moomin
Reminder that no system is production ready until aphyr has had time to break
it. :)

~~~
ongardie
Yes, good point. I welcome Mr. McCaffrey to give it a spin, if he's so
willing. And though there may be critical bugs left to find, I don't feel like
I've misrepresented the current state of LogCabin.

------
frankchn
Nice work! A Curator-like
([http://curator.apache.org/](http://curator.apache.org/)) framework on top of
LogCabin will make it immensely more useful for everyday production systems
though.

------
natch
What is a coordination service used for?

~~~
danesparza
Generally speaking: Application/service configuration and service discovery.
Another example of this is [http://consul.io](http://consul.io)

------
azinman2
What the webpage doesn't do is explain why I want this over the competitors.
That would be helpful.

~~~
ongardie
In part that's because I don't see other Raft implementations as competitors.
A major goal in Raft was to enable many implementations. It'd be a huge fail
if everyone switched to LogCabin and abandoned all the other Raft
implementations.

Really though, it's probably not a feature list that'll make people want to
use LogCabin. In Scale Computing's case, it's that they have a C++ code base
that they could integrate LogCabin well with, and they know they can work with
and maintain the code if they need to.

------
jbandela1
Just the other day, I was looking for a raft consensus implementation in C++
and was disappointed at not finding any that were production ready (as opposed
to Java and Go). This fits the need perfectly. Also, RamCloud says it requires
ZooKeeper. Do you have any idea of the timeframe of when RamCloud will use
LogCabin instead.

~~~
ongardie
Good timing then, and I'd be happy to talk more about whether LogCabin is a
good fit for your use.

RAMCloud used to depend on an earlier version of LogCabin (before the data
model was a key-value tree), then John rewrote the RAMCloud end of that and
switched it to using ZooKeeper, but he made it pluggable on the RAMCloud side.
So we just need an "ExternalStorage" implementation for LogCabin in RAMCloud
to make that work again. It should be relatively straightforward and I'm
confident they'd welcome a well-written and tested patch, but no one has
volunteered yet.

------
theossuary
That video was pretty great, 50:47 made me laugh:
[https://youtu.be/2dfSOFqOhOU?t=3046](https://youtu.be/2dfSOFqOhOU?t=3046) .

------
0xdeadbeefbabe
In raft how can a client know it is receiving stale, yet agreed upon, data?
I'm talking about the edge case of having a network partition or some other
condition like it.

~~~
ongardie
Raft implementations can choose to implement reads in various ways. In
LogCabin reads are linearizable, meaning that the results are current as of
sometime after the read request was initiated, and it doesn't rely on bounded
clock drift to make this guarantee. That's about the best you can do in terms
of freshness in a distributed system. Other implementations might offer weaker
read semantics or give the client the option. Check out 6.3-6.4 in my
dissertation for a ton more detail:
[https://github.com/ongardie/dissertation#readme](https://github.com/ongardie/dissertation#readme)

~~~
0xdeadbeefbabe
I'm still misunderstanding what happens when a network partition creates a two
leader scenario. It is in a slide here
[http://thesecretlivesofdata.com/raft/](http://thesecretlivesofdata.com/raft/)
under "log replication" after creating a network partition that separates CDE
from AB. Doesn't that mean a client of node B could read stale data? Or maybe
an implementation like etcd would let the client choose to read or not in that
case?

~~~
ongardie
If the network is split into AB and CDE, then only CDE will be able to reach a
quorum. If a client is on the AB side of that split, no consensus algorithm
could guarantee freshness on a read. LogCabin clients will continually retry
until they can talk to a functioning leader (and clients can put a timeout on
that) so that you get a guaranteed fresh read every time; other systems could
choose to return a stale result right away instead. It's up to the
implementation, but I strongly recommend people not do stale reads by default.

------
alexnewman
Congrats, how does compaction work? I assume you can serve kvs from memory?

~~~
alexnewman
Oh and can I call this from c easily?

~~~
ongardie
There's not currently an extern "C" wrapper, but I don't think that'd be too
much work to write.

