The two sound very similar. I've been studying Bookkeeper this week to understand how it works and was excited to see the blogpost about LogDevice.
My understanding isn't fully there yet, but BookKeeper seems to have a more involved protocol versus LogDevice: ie it has fencing to ensure that only one writer is writing to a log at a time a 2-phase-commit-like protocol, and opening/closing of ledgers.
To me LogDevice's sequencer and epoch number sounds much simpler. Does BookKeeper achieve some more consistency or other guarantees that LogDevice doesn't by having a more involved protocol? How do the two compare in terms of goals/pros/cons/tradeoffs/usecases/etc?
(From reading the blog post it seems to me that in LogDevice an old sequencer could still be writing to LogDevice when a new one with a higher epoch number is also writing. As I understand it BookKeeper uses a CAS operation on metadata version number (like the epoch number of LogDevice) AND fencing on the storage nodes to make sure that only a single writer is writing at a time).
I'm not familiar with Bookkeeper, so can't comment on that. I do know that, compared to most other systems, LogDevice emphasizes high write availability. So even if we're still sorting out the details of which records made it at the end of one epoch, we'll still take writes for future epochs. We just won't release them to readers until we've made enough copies of earlier records.
We do our best to ensure that only one sequencer is running at a time. We use Zookeeper to store information but the current epoch & sequencer, and whenever a new sequencer starts, it needs to talk to Zookeeper. So that helps with races where several machines want to become the sequencer for a single log at the same time.
Also, when clients are looking for the sequencer for a given log, they all try the same set of machines in the same order. Essentially there's a deterministic shuffling of the list of servers, seeded with the log id. So all clients will try to talk to the same server first, and only if that server is down, or learns that another sequencer is active, will the client try a different server.
Hope that helps!
In addition to the general model of log writing + reactive processing, I've also found that we spend a lot of time focused on getting the _type structure_ of our data right (which allows for flexible/generic downstream processes that can also be efficient and type-checked against the structure of the data ahead of time). I wonder if/how this aspect of the problem has been considered by the Facebook team.
How does a consumer that have retrieved N and N+2 know if N+1 is not yet written, or if it failed and will never be written? Perhaps they write-through the sequencer with subsequent writes waiting on acknowledgements, so 'gaps' only occur on epoch changes?
Within an epoch, the sequencer is a single process on a single machine and gives out LSNs sequentially. When a sequencer dies, and a new one is started, its first job is to fix up the end of the last epoch. If it can't find any copies of a given record, it inserts a "hole plug," to store the fact that the record is lost. So, except for hole plugs (which should be very rare), the only gaps are between epochs as you say.
Suppose you have 10 LogDevice servers, and you store 3 copies of every record. Then the client will have connections to all 10 machines, and each machine will push whatever records it has to the client. Crucially, the servers always push records in order. So if a client gets record N from machine 7, then gets record N+2, it can be sure that machine 7 doesn't have a copy of record N+1.
Once you've got record N+2 or higher from at least 8 machines, without getting record N+1 from any of them, you can be sure that at least one copy of the record has been lost. If those 8 servers have complete information, you can report to the user that the record has been lost.
Can you comment on how writes to the LogDevice servers are performed - particularly if the writes have to go through the sequencer, or if the individual producers can write directly after obtaining a LSN?
Since a given LogDevice server will only receive a non-deterministic sub-sequence of records, I would think it has to receive its set of writes in order? (To have enough information to push to clients in order). Unless there is actually some mechanism by which it can determine the elements it should receive.
Trying to understand if your per-log throughput will be capped by the max traffic that can be pushed through a single sequencer, or the round-trip latency for waiting for acknowledged writes from LogDevice servers.
Yes. You can think of an individual log like a database shard.
We have plans to allow the sequencer to just give out sequence numbers, and allow the clients to send the data directly to the storage nodes. But its not a high priority for us, since it's just a constant factor improvement. (Although a large one!) Users would still need to shard their data, although not as much.
The very fact that the announcement includes "the ultimate goal of contributing it to the open source community later in 2017" makes the licensing issue pertinent, particularly given Facebook's licensing history.
1. Do the log records pass through the sequencer nodes on the way to their destination, or does the writer issue a request for the sequence number and wait for a response?
2. If you're maintaining the current epoch number (and sequencer location) in Zookeeper, and a sequencer starts going up and down, you're definitely going to have problems, because nodes will get potentially stale epoch/location data from Zookeeper. Unless, that is, you have Zookeeper run a full consensus algorithm for every single query for every single record you write, which would kill performance.
So as long as you're using single sequencer nodes, and dealing with replication, and doing some kind of leader election to pick a new sequencer node when one goes down, why not just run a normal consensus algorithm (like Raft) for storing records? The Raft leader node serves as the sequencer, it takes care of replication, it tolerates failures, it will be no slower than what you're doing now, and it will be way more reliable.
Keeping SS tables makes it more sequential write and reasonably sequential write, as long as you have enough RAM to get multiple records of each log, so they constitute a continues blocks in flashed file.
Actually you could get very similar result using Cassandra, which also uses SS tables. The difference is that Cassandra keeps merging files, which actually makes much more IO traffic than clients. Cassandra will typically need 16x more IO for merging then actual data write rate. You can limit it a bit if you create time shard tables.
Can someone explain/expand on the above please? I've read the article a couple of times and tried to understand the above paragraph in context but I don't get it.
A copy set is the list of storage node ids that the sequencer chose as recipients for a record. This piece of metadata is stored on storage nodes alongside record payloads in an index that maps the sequence number to copy set mapping.
Storage nodes consult the index to filter out payloads they shall not send to readers. The filtering logic consists in sending the copy if the storage node sees itself as the first recipient in the copy set after it's been shuffled using the client id as a seed.
This results in readers receiving exactly one copy of the payload.
Edit: This is a good thing
Frankly, if it's the same license as React (BSD + PATENTS), I'm not interested.
Edit: Here come the facebook fanboys with the downvotes.
> don't post off topic
Users of HN, please note "we" is Facebook and Facebook, or a representative of Facebook, is telling us what we can and can't discuss here.
This is why I quit posting here and use /r/hackernews on Reddit for my links. However, I occasionally make an appearance when I see patterns emerge, such as control where we didn't assume control existed.
Why do you decide what we discuss? Since you squashed my comment about licensing, another one popped up. It means people do think licensing is important to discuss.
I believe Facebook was built, in part, using software under MIT, BSD, Apache, etc – I see it as immoral when they turn around and release OSS software under BSD + PATENTS.
BSD + PATENTS sounds like a great idea in theory – if every company released under BSD + PATENTS it would essentially render the patent system toothless, except for the patent trolls who would still be free to sue whomever they want. In practice, however, it would be hard for BSD + PATENTS to have widespread adoption because it would be seen as a loss of value to shareholders, especially as for some companies their patent portfolio is an important part of their valuation.
> Having known so many people involved with Facebook for so long, I have come up with a phrase to describe the cultural phenomenon I’ve witnessed among them – ladder kicking. Basically, people who get a leg up from others, and then do everything in their power to ensure nobody else manages to get there. No, it’s not “human nature” or “how it works.” Silicon Valley and the tech industry at large weren’t built by these sorts of people, and we need to be more active in preventing this mind-virus from spreading.
Are we really allowing them to do this for the sake of convenience?
Atop that, I'm just getting really tired of seeing the patent issue trotted out any and every time facebook or react are mentioned here.