I think it's fairly simple and might be enough.
Can't comment on storage requirements thou.
I imagine you looked at other solutions before starting this. A distributed log is a fairly simple idea to understand (hard to implement) but what pain point is being solved?
Seeing that it is written in C/C++ - would it be that logdevice is optimised purely for speed and responsiveness?
It seems very similar.
In terms of function. LogDevice is similar to the core of Apache Kafka.
- There is no many-to-many log recovery whereas -- for example in Pulsar/DistributedLog -- logs are stored in small segments and distributed to multiple nodes.
- Read scalability. Since all the log is stored in one node (with some replicas) the readers are bound to single disk sequential read capacity. Again Pulsar stores logs in segments that are distributed among broker nodes which helps a lot when there are many readers.
All of this is done while preserving the total ordering guarantee thanks the separation of sequencing and storage.
The operator could for example set a bigger node set size for logs that are known to have multiple consumers and require more IO capacity.
At facebook, we have use cases where a single consumer will need to replay a backlog of records in a log, sometimes hours or days worth of data to rebuild its state. We call this a backfill. Node sets allow the IO to be spread across multiple disks which improves backfill speed and helps reduce hotspots.
-- Adrien from the LogDevice team.
Curious where logdevice would be a bad decision..
In Kafka bulk reading is very cheap, the broker basically just calls sendfile() to send a file segment with compressed message chunks. On the other hand only the leader of a partition can serve requests, so you are often limited by bandwidth. It looks like LogDevice has to do a bit more work server side, but may be able to read from all servers with a replica.
Kafka stores more metadata in the record wrapper, like client and server timestamps and partition key.
There are client libraries for C++ and Python.
Operationally they look similar - both require a Zookeeper cluster, and both require assigning permanent ids to nodes.
It would be interesting to see some benchmarks comparing LogDevice with Kafka and Pulsar. That said, I suspect from the lack of buzz around Pulsar that Kafka isn't a performance bottleneck for most people using it.
Unfortunately it's hard to change something that already works. Most users don't hit the performance limits of their tools so they'll just continue Kafka if it's already running.
If anyone from the FB team or anyone using LogDevice wants to test performance with Optane SSDs (and compare to a NAND SSD), make a request by submitting an issue on our GitHub page: https://github.com/AccelerateWithOptane/lab/issues. I'll hook you up with a server hosted by Packet.
Do people just point their system journal at Kafka and wait for something to break?
At my previous job we built something similar to this out of rabbitmq and mongodb. I always wondered what the other big log companies used. Mongodb seemed like a pretty good fit, but a pure append only database might be even better. Trimming performance in MongoDB was subpar so we worked around it by creating a new collection for each day, trimming became a simple operation of dropping a collection at the end of each day.
Kafka can be used as a data store if you like, so long as you're happy with the data management and access patterns it gives you - it is, after all, optimised for large sequential reads.
LogDevice looks to be very similar for most use cases to Kafka, hell, they even use RocksDB, which is used by stateful operations in Kafka Streaming, and of course, Zookeeper.
Where it differs is that it looks like it was designed for you to be able to work against a single "cluster" that could well be running across multiple data-centres. Which is very much a Facebook problem to solve.
So yeah, Kafka was a distributed log built for LinkedIn size problems, LogDevice is a distributed log built for Facebook sized problems.
Most of us don't have Facebook sized problems.
Humio is not self-hosted or open source, so not really a fair comparison. It also seems targeted towards operational logs, i.e. system logging, traffic logging, auditing. Not things like data pipelines. Kafka and friends can be used for that kind of log, but they are more like databases; they use the term "log" in the sense of sequential and append-only.
Same goes for Splunk, which does have a self-hosted version, but is extremely expensive, last I checked. The SaaS version is also extremely expensive.
The UI is what simplifies analysis and visualization with live, real-time query and db.
MongoDB is a full OLTP document store so it won't match the write throughput and pubsub features of these focused systems. RabbitMQ on the other hand has performance limits but is meant for complex service-bus style routing and RPC uses, but I recommend using NATS for that now.
The announcement does not clarify the reason they use this over kafka. Is it because Kafka doesn't scale to millions of logs on a single cluster or is it because kafka is not sympathetic to heterogeneous disk arrays containing SSD and HDD. I strongly suspect it may be latency of writes at scale but this is pure speculation.
I don't know. If I understand why anyone might use this I'd contribute to building language bindings for the APIs.
- It's designed to work with a large number of logs (roughly equivalent to partitions in Kafka), hundreds of thousands per cluster is common.
- Sequencer failover is very quick, typical failover time when a sequencer node fails is less than a second.
- It supports location awareness and can place data according to replication constraints specified (e.g. replicate it in 3 copies across 2 different regions and 3 racks).
- Because of non-deterministic data placement, it is very resilient to failures in terms of write availability.
- If a node/shard fails, it detects the failure and rebuilds the data that was replicated to failed nodes/shards automatically
I am happy to expand more on this point.
We have this concept of "node set" of a log which is the set of storage nodes available to receive record copies sent by the sequencer. It is typically made of 20-30 nodes in typical deployments at Facebook. Write availability is maintained as long as enough storage nodes in the node set are available to accept copies. When storage node failures are detected, the sequencer can just exclude these nodes from the list of potential recipients for new records. It does not need to update a view that needs to be synchronized with readers, which is a heavy-weight operation. This model allows preserving high write availability even if many nodes in the node set are unhealthy.
Additionally, this record copy placement flexibility allows the sequencer to quickly route around latency spikes on individual storage nodes, which helps guarantee low append latency.
I doubt that's it, since Kafka can certainly do that.
As far as millions of topics, if you have to do it at a logical layer yourself, then you might as well use a system that supports it natively.
Try benchmarking Kafka from 0 partitions to a few thousand partitions in 100 partition increments. The benchmark only needs to write to a single topic, using their provided producer perf tool while all other topics are inactive with zero data.
As the partitions increase there is a very noticeable drop in throughout that looks to be linear.
Kafka does not handle a large number of partitions well currently, large even being low thousands. It's easy to hit with just a few hundred topics.
Reading between the lines ehen Linkdin and Netflix advertise several clusters, i am predicting/guessing they shard the data.
Kafka has done well so far, especially in making streaming systems more common, but it's about time for the next-gen systems.
Meanwhile the compute layer becomes very lightweight and almost stateless, which is easy to scale. In LogDevice, the Sequencers are potential bottlenecks but generating a series of incrementing numbers is about the fastest thing you can do so it'll outpace any actual data ingest to a single log, while giving you a total order of all entries within that log. The numbers (LSNs) follow the Hi/Lo sequence pattern so if a Sequencer fails, another one takes its place with a greater "High" number, so it's guaranteed that all of its LSNs will be greater than the previous Sequencer as a result. This also provides a built-in buffer to still accept messages and assign the permanent LSNs to them after recovery in case a Sequencer fails.
Apache Pulsar is similar to LogDevice but goes further where brokers manage connections, routing and message acknowledgements while data is sent to a separate layer of Apache Bookkeeper nodes which store the data in append-optimized log files.
One question though: will Presto support querying from LogDevice directly? :)
I'm not that familiar with Kafka, but in general LogDevice emphasizes write availability over read availability. There are many applications where data is being generated all the time, and if you don't write it, it will be lost. However, if reading is delayed, it just means readers are a little behind and will need to catch up.
So, when a sequencer node dies and we need to figure out what happened to the records that were in flight -- which ones ended up on disk & can be replicated, what the last record was -- LogDevice still accepts new writes. However, to ensure ordering, these new writes aren't visible to readers until the earlier writes are sorted out.
For a simple case where placement of data is location-agnostic, indeed the definition of f-majority is n - r + 1, where n is the nodeset size, and r is the replication factor.
However, if your replication property, is say, "place 3 copies across 3 racks", then the definition of f-majority becomes more complicated - e.g. having all nodes in the nodeset respond minus two racks will also satisfy it.
Which are the cases where consistency is compromised then? If a client of the log needs consistency, it needs to ensure that it has seen all previous updates to a log before making a new update, which implies a read.
> However, if your replication property, is say, "place 3 copies across 3 racks", then the definition of f-majority becomes more complicated - e.g. having all nodes in the nodeset respond minus two racks will also satisfy it.
Sure, the aim being that no write can be successfully acknowledged by enough replicas to complete the write.
Consistency in a more general sense than just read-modify-write consistency. If you have sequencers active in several epochs at the same time accepting writes, the records may end up being written out of order, and there would be a breakage of the total ordering guarantee.
But given that reads are blocked on all sequencers before the current one, this should still provide total order atomic broadcast, unless a single client can connect to a sequencer with a lower epoch than one it has already seen.
However, there can still be reordering in the context of a wider system. E.g. if client A sends a write (w1) to sequencer in epoch X, which gets replicated and acknowledged, and after that client B sends a write (w2) to sequencer in epoch (X-1) which gets replicated and acknowledged (because epoch X-1 is not sealed), then readers eventually will see w2 before w1. If writes in epoch X weren't accepted before the sealing of the epoch (X-1) had completed, this reordering would be impossible, however as a result write availability would suffer.
Anyhow, thanks for answering my questions. Very interesting system.
Scribe isn’t the only place where LogDevice is used though — Facebook has documented using it for TAO as well (as part of the secondary indices)
Unless we're talking about two different projects named Scribe, which is certainly possible.
At any rate, I used “equivalent” here to mean that, while different trade-offs have been made, it has the sort of users building the same sort of applications on the same sort of abstractions — it plays the same role, for all intents and purposes.
It kind of does, on the write pipeline. Tailers vary, but scribed controls the semantics of how your message gets delivered to LogDevice.
And it uses RocksDB under the hood:
Could a LogDevice give a bit of informations about the scale they use that at facebook ?
- How many record this thing can injest per day ?
- Any limitations on the maximum number of storage nodes ?
- What would be your maximum and advise size of record for a production usage ?
- ZooKeeper seems to be the center point used as epoch provider. Did you encounter any scaling limitations or max number of client due to that ?
Hope that helps.
I like the idea of decoupling compute from storage for streaming/log data.
I wonder if it would be easy to make it run under Consul, instead of ZooKeeper.
Store up to a million logs on a single cluster.
This sounds pretty confusing / low volume.
So there's no cost to open sourcing. The benefit comes from being known as technically innovative in general, and for recruiting, being known as having interesting, meaty, challenging projects to work on.
The impetus usually comes from team members who want to do the work. It could be to become known for having worked on the project, or a sense of giving back to the community, or a hope that you'll get bug fixes & features from outside contributors. In my (very limited) experience, managers "passively encourage" it -- they generally don't push the team to do it, but when the team asks, they encourage it.
Not true, there's a legal cost associated with making sure something is really ready for public eyes.
The cost of maintaining an open source project is real, but when it is a world-class piece of infrastructure, open sourcing it helps keep it world-class.
Previous discussion in HN: https://news.ycombinator.com/item?id=15142266
Since it doesn't say anything about trustlessness, I assume that it assumes that all nodes are trusted.
All daemons and system administration utilities belong into sbin, because bin is for end-user applications.
Historically, the "s" in sbin meant something else, but it always contained applications and scripts only root could run.
When I see these examples, it's depressing to see just how much understanding of UNIX is missing.
That's not the point. The point is that all these generation Y kids grew up on PC buckets and still don't understand UNIX and the concepts behind it, and yet they use it to power their applications. This can only end badly unless they start making an effort to understand the concepts behind the substrate they are writing software for.
Then, here's my two cents. When engaging in conversation and civil dialog, please try to avoid being so dismissive and so proud of yourself and of how much you think you know about stuff. You come across as abrasive and entitled. It's not nice to just jump into a conversation and talk trash about the work of others just because you dislike the operating system that they use.
Finally, if you really care, work on porting it to your operating system of choice and engage in civil conversation doing pull requests, etc. Everybody will be thankful for that.
I recognize the good intention here, but if you're going to post this kind of comment, please try to eliminate the personal provocations. They don't help, and do hurt.
Regardless of your temperament, can you please not be abrasive/aggressive on HN? It encourages worse from others and leads to a toilet-whirlpool effect.
- Cross vendor replication which makes migration much easier.
- No dependency on vendor provided replication protocols.
- Ability to use in-app databases such RocksDB, SQLite, ...
- Upgrading DB nodes becomes way easier since they are totally separated from each other.