Kafka brokers handle both the computation (partition/topic management, sequencing, assignments, etc) and storage together. This coupling creates scaling and operational challenges which LogDevice removes by separating the layers. Storage nodes can be as simple as object stores (but optimized for appending files) and use multiple non-deterministic locations for a given piece of data to randomize placement. They read, write and recover data very quickly by working together in a mesh.
Meanwhile the compute layer becomes very lightweight and almost stateless, which is easy to scale. In LogDevice, the Sequencers are potential bottlenecks but generating a series of incrementing numbers is about the fastest thing you can do so it'll outpace any actual data ingest to a single log, while giving you a total order of all entries within that log. The numbers (LSNs) follow the Hi/Lo sequence pattern so if a Sequencer fails, another one takes its place with a greater "High" number, so it's guaranteed that all of its LSNs will be greater than the previous Sequencer as a result. This also provides a built-in buffer to still accept messages and assign the permanent LSNs to them after recovery in case a Sequencer fails.
Apache Pulsar is similar to LogDevice but goes further where brokers manage connections, routing and message acknowledgements while data is sent to a separate layer of Apache Bookkeeper nodes which store the data in append-optimized log files.
I worked on LogDevice at FB until about 6 months ago.
I'm not that familiar with Kafka, but in general LogDevice emphasizes write availability over read availability. There are many applications where data is being generated all the time, and if you don't write it, it will be lost. However, if reading is delayed, it just means readers are a little behind and will need to catch up.
So, when a sequencer node dies and we need to figure out what happened to the records that were in flight -- which ones ended up on disk & can be replicated, what the last record was -- LogDevice still accepts new writes. However, to ensure ordering, these new writes aren't visible to readers until the earlier writes are sorted out.
What happens with a sequencer which appears to fail but hasn't really, and then comes back up after the process to figure out inflight records has completed? If that sequencer receives a record for a log, will it be able to write it to the storage nodes? I.e. is there any fencing mechanism to tell the storage nodes that the epoch has been bumped, so don't access writes for that epoch anymore?
yes, in LogDevice it's called "sealing". However, as it stands, a newly activated sequencer won't wait for sealing on the old epoch to complete before taking new writes - in the tradeoff between write availability and consistency LogDevice picks higher availability. Blocking new writes until sealing is complete, however, should be fairly easy to integrate into LogDevice as an option.
Does it block reads until sealing is complete? How many nodes in the nodeset have to respond before sealing is complete? [NodeSet] - [ReplicationFactor] + 1?
Yes, reads are not released (i.e. are blocked) until sealing is complete. We call the minimal set of nodes sufficient to serve reads for a log (the same set is needed for sealing to complete) an f-majority.
For a simple case where placement of data is location-agnostic, indeed the definition of f-majority is n - r + 1, where n is the nodeset size, and r is the replication factor.
However, if your replication property, is say, "place 3 copies across 3 racks", then the definition of f-majority becomes more complicated - e.g. having all nodes in the nodeset respond minus two racks will also satisfy it.
> Yes, reads are not released (i.e. are blocked) until sealing is complete.
Which are the cases where consistency is compromised then? If a client of the log needs consistency, it needs to ensure that it has seen all previous updates to a log before making a new update, which implies a read.
> However, if your replication property, is say, "place 3 copies across 3 racks", then the definition of f-majority becomes more complicated - e.g. having all nodes in the nodeset respond minus two racks will also satisfy it.
Sure, the aim being that no write can be successfully acknowledged by enough replicas to complete the write.
> Which are the cases where consistency is compromised then? If a client of the log needs consistency, it needs to ensure that it has seen all previous updates to a log before making a new update, which implies a read.
Consistency in a more general sense than just read-modify-write consistency. If you have sequencers active in several epochs at the same time accepting writes, the records may end up being written out of order, and there would be a breakage of the total ordering guarantee.
> Consistency in a more general sense than just read-modify-write consistency. If you have sequencers active in several epochs at the same time accepting writes, the records may end up being written out of order, and there would be a breakage of the total ordering guarantee.
But given that reads are blocked on all sequencers before the current one, this should still provide total order atomic broadcast, unless a single client can connect to a sequencer with a lower epoch than one it has already seen.
LogDevice clients do notify sequencers if they have seen newer epochs, which would cause a sequencer reactivation, which indeed resolves the issue within the context of a single client.
However, there can still be reordering in the context of a wider system. E.g. if client A sends a write (w1) to sequencer in epoch X, which gets replicated and acknowledged, and after that client B sends a write (w2) to sequencer in epoch (X-1) which gets replicated and acknowledged (because epoch X-1 is not sealed), then readers eventually will see w2 before w1. If writes in epoch X weren't accepted before the sealing of the epoch (X-1) had completed, this reordering would be impossible, however as a result write availability would suffer.
Ok, but for this to be problematic, readers would need to have some other mechanism to know that w1 did actually take place before w2. So FIFO instead of total order.
Anyhow, thanks for answering my questions. Very interesting system.
Ah, I think you may be talking about the repeatable reads property? All readers in LogDevice are guaranteed to see the same records in the same order (aside from trimmed data).
What I was wondering really, was whether LogDevice provides total order atomic broadcast, and as such whether it solves concensus. It appears it does (or rather, it daisychains on the concensus provided by zookeeper and uses it's own fencing mechanism, similar to what bookkeeper/Pulsar does).
As lclarkmichalek said, there’s more than one way to skin a cat.
At any rate, I used “equivalent” here to mean that, while different trade-offs have been made, it has the sort of users building the same sort of applications on the same sort of abstractions — it plays the same role, for all intents and purposes.
I’m wondering this as well. In the description it says that it ensures total ordering while Kafka only ensures partition ordering. I haven’t read enough to say more.