Meanwhile the compute layer becomes very lightweight and almost stateless, which is easy to scale. In LogDevice, the Sequencers are potential bottlenecks but generating a series of incrementing numbers is about the fastest thing you can do so it'll outpace any actual data ingest to a single log, while giving you a total order of all entries within that log. The numbers (LSNs) follow the Hi/Lo sequence pattern so if a Sequencer fails, another one takes its place with a greater "High" number, so it's guaranteed that all of its LSNs will be greater than the previous Sequencer as a result. This also provides a built-in buffer to still accept messages and assign the permanent LSNs to them after recovery in case a Sequencer fails.
Apache Pulsar is similar to LogDevice but goes further where brokers manage connections, routing and message acknowledgements while data is sent to a separate layer of Apache Bookkeeper nodes which store the data in append-optimized log files.
One question though: will Presto support querying from LogDevice directly? :)
I'm not that familiar with Kafka, but in general LogDevice emphasizes write availability over read availability. There are many applications where data is being generated all the time, and if you don't write it, it will be lost. However, if reading is delayed, it just means readers are a little behind and will need to catch up.
So, when a sequencer node dies and we need to figure out what happened to the records that were in flight -- which ones ended up on disk & can be replicated, what the last record was -- LogDevice still accepts new writes. However, to ensure ordering, these new writes aren't visible to readers until the earlier writes are sorted out.
For a simple case where placement of data is location-agnostic, indeed the definition of f-majority is n - r + 1, where n is the nodeset size, and r is the replication factor.
However, if your replication property, is say, "place 3 copies across 3 racks", then the definition of f-majority becomes more complicated - e.g. having all nodes in the nodeset respond minus two racks will also satisfy it.
Which are the cases where consistency is compromised then? If a client of the log needs consistency, it needs to ensure that it has seen all previous updates to a log before making a new update, which implies a read.
> However, if your replication property, is say, "place 3 copies across 3 racks", then the definition of f-majority becomes more complicated - e.g. having all nodes in the nodeset respond minus two racks will also satisfy it.
Sure, the aim being that no write can be successfully acknowledged by enough replicas to complete the write.
Consistency in a more general sense than just read-modify-write consistency. If you have sequencers active in several epochs at the same time accepting writes, the records may end up being written out of order, and there would be a breakage of the total ordering guarantee.
But given that reads are blocked on all sequencers before the current one, this should still provide total order atomic broadcast, unless a single client can connect to a sequencer with a lower epoch than one it has already seen.
However, there can still be reordering in the context of a wider system. E.g. if client A sends a write (w1) to sequencer in epoch X, which gets replicated and acknowledged, and after that client B sends a write (w2) to sequencer in epoch (X-1) which gets replicated and acknowledged (because epoch X-1 is not sealed), then readers eventually will see w2 before w1. If writes in epoch X weren't accepted before the sealing of the epoch (X-1) had completed, this reordering would be impossible, however as a result write availability would suffer.
Anyhow, thanks for answering my questions. Very interesting system.
Scribe isn’t the only place where LogDevice is used though — Facebook has documented using it for TAO as well (as part of the secondary indices)
Unless we're talking about two different projects named Scribe, which is certainly possible.
At any rate, I used “equivalent” here to mean that, while different trade-offs have been made, it has the sort of users building the same sort of applications on the same sort of abstractions — it plays the same role, for all intents and purposes.
It kind of does, on the write pipeline. Tailers vary, but scribed controls the semantics of how your message gets delivered to LogDevice.