
Facebook open-sources LogDevice, a distributed storage for sequential data - cedricvg
https://logdevice.io/
======
akavel
Can someone from FB chime in with some info how much storage is needed for the
logs/data? Say, for 1 GB of raw input logs from a http server (nginx/apache),
when stored in LogDevice would they take notably less space on disk
(compression), or more (overhead)? This interests ne for evaluating
resources/costs I'd need to prepare if I were to deploy it...

~~~
cedricvg
These numbers really depend on the compressibility of the content, compression
scheme and the type of batching used. The metadata overhead is fairly minimal.
LogDevice allows you to configure this on either the client, sequencer or
rocksdb level.

~~~
akavel
Is some form of compression enabled by default, without having to tweak
options?

~~~
sh00s
No, compression is disabled by default. You can enable compression either on
the storage layer, or by enabling batching and compression on the sequencer,
or by using the buffered write API with compression on the client.

------
AhmedSoliman
Happy to finally see LogDevice open. We have been working on this for years
now.

~~~
the_duke
Can you give an overview over the difference to eg Apache Kafka?

It seems very similar.

~~~
AhmedSoliman
It's a very different architecture and design. You can head to
[https://logdevice.io/docs/Concepts.html](https://logdevice.io/docs/Concepts.html)
to learn more about how LogDevice works.

In terms of function. LogDevice is similar to the core of Apache Kafka.

~~~
majidazimi
True, but Kafka has two very annoying features built into it:

\- There is no many-to-many log recovery whereas -- for example in
Pulsar/DistributedLog -- logs are stored in small segments and distributed to
multiple nodes.

\- Read scalability. Since all the log is stored in one node (with some
replicas) the readers are bound to single disk sequential read capacity. Again
Pulsar stores logs in segments that are distributed among broker nodes which
helps a lot when there are many readers.

~~~
LgWoodenBadger
I'm not sure how accurate your comment is regarding Kafka's annoying features
given that Kafka has partitions, which "solve" all of the problems you stated.

~~~
majidazimi
No it doens't, since a single partition is stored sequentially on one disk
which limits the consumers to bandwidth of single disk (say c1 reads beginning
of the partition and c2 end of the partition). But in the case of Pulsar c1 is
most probably connected to a different node than c2.

~~~
adrienconrath
LogDevice has this concept of "node set", which is the set of storage nodes
that can be selected by the sequencer as recipients for a record or a block of
records. A typical node set size is around 20-30 in our deployments. Each
storage node in the node set contains a subset of the records (or blocks of
records) of the log, we call that subset a log strand. The amount of IO
capacity available to append records to a log or read records from a log
scales with the size of the node set.

All of this is done while preserving the total ordering guarantee thanks the
separation of sequencing and storage.

The operator could for example set a bigger node set size for logs that are
known to have multiple consumers and require more IO capacity.

At facebook, we have use cases where a single consumer will need to replay a
backlog of records in a log, sometimes hours or days worth of data to rebuild
its state. We call this a backfill. Node sets allow the IO to be spread across
multiple disks which improves backfill speed and helps reduce hotspots.

\-- Adrien from the LogDevice team.

------
Rafuino
Very interesting! I hadn't heard of this before but I'd love to see it in
action.

If anyone from the FB team or anyone using LogDevice wants to test performance
with Optane SSDs (and compare to a NAND SSD), make a request by submitting an
issue on our GitHub page:
[https://github.com/AccelerateWithOptane/lab/issues](https://github.com/AccelerateWithOptane/lab/issues).
I'll hook you up with a server hosted by Packet.

------
fullmetaleng
Martin Kleppmann seems to point out technologies for problems of similar
patterns already exist -
[https://twitter.com/martinkl/status/1039938408393662465](https://twitter.com/martinkl/status/1039938408393662465)

~~~
tinco
Those are streaming/pubsub services though, this actually claims to be a
store. I feel that's an important difference.

Do people just point their system journal at Kafka and wait for something to
break?

At my previous job we built something similar to this out of rabbitmq and
mongodb. I always wondered what the other big log companies used. Mongodb
seemed like a pretty good fit, but a pure append only database might be even
better. Trimming performance in MongoDB was subpar so we worked around it by
creating a new collection for each day, trimming became a simple operation of
dropping a collection at the end of each day.

~~~
EdwardDiego
> Those are streaming/pubsub services though, this actually claims to be a
> store. I feel that's an important difference. > Do people just point their
> system journal at Kafka and wait for something to break?

Kafka can be used as a data store if you like, so long as you're happy with
the data management and access patterns it gives you - it is, after all,
optimised for large sequential reads.

LogDevice looks to be very similar for most use cases to Kafka, hell, they
even use RocksDB, which is used by stateful operations in Kafka Streaming, and
of course, Zookeeper.

Where it differs is that it looks like it was designed for you to be able to
work against a single "cluster" that could well be running across multiple
data-centres. Which is very much a Facebook problem to solve.

So yeah, Kafka was a distributed log built for LinkedIn size problems,
LogDevice is a distributed log built for Facebook sized problems.

Most of us don't have Facebook sized problems.

~~~
Serow225
What's a good distributed log for 10-dev sized companies? :)

~~~
simonrobb
OKLog, Humio, and Splunk are all worth checking out.

~~~
atombender
OKLog has been abandoned by the author (the project is now read only on
GitHub).

Humio is not self-hosted or open source, so not really a fair comparison. It
also seems targeted towards _operational_ logs, i.e. system logging, traffic
logging, auditing. Not things like data pipelines. Kafka and friends can be
used for that kind of log, but they are more like databases; they use the term
"log" in the sense of sequential and append-only.

Same goes for Splunk, which does have a self-hosted version, but is extremely
expensive, last I checked. The SaaS version is also extremely expensive.

~~~
geeters1
Humio does have a self-hosted version - but is closed source. You can download
a trial and it supports most logs types via support for opensource log
shippers, ie logstash and beats..along with other popular formats including
Kafka.

[https://docs.humio.com/integrations/](https://docs.humio.com/integrations/)

The UI is what simplifies analysis and visualization with live, real-time
query and db.

------
mmcclellan
I had just stumbled across [https://github.com/facebookincubator/python-
nubia](https://github.com/facebookincubator/python-nubia) and am anxious to
try it out. Was wondering about the internal project it was factored out from.
This appears to be it.

~~~
AhmedSoliman
Correct. LDShell in logdevice was the starting point of python-nubia.

------
thinkersilver
The use cases overlap neatly with Kafka's. Everything from it's usage of
zookeeper, time-and-storage-based retention tuning are similar

The announcement does not clarify the reason they use this over kafka. Is it
because Kafka doesn't scale to millions of logs on a single cluster or is it
because kafka is not sympathetic to heterogeneous disk arrays containing SSD
and HDD. I strongly suspect it may be latency of writes at scale but this is
pure speculation.

I don't know. If I understand why anyone might use this I'd contribute to
building language bindings for the APIs.

~~~
otterley
> Is it because Kafka doesn't scale to millions of logs on a single cluster

I doubt that's it, since Kafka can certainly do that.

~~~
manigandham
Millions of separate topics on a single Kafka cluster? The way it's designed
requires opening files for all of those topics and their partitions so good
luck if you're trying that. You'll run out of file handles, then memory, and
then the disk access will completely freeze up.

~~~
otterley
I didn't think we were speaking of millions of topics here; only millions of
logs. You can certainly have logs numbering in the millions using a single
topic. Mux/demux would have to happen at the producer/consumer side, of
course.

~~~
manigandham
Do you mean log segments then? In that case I don't see what's special about
it because that's just rolling files and all of these systems can handle
millions that way.

As far as millions of topics, if you have to do it at a logical layer
yourself, then you might as well use a system that supports it natively.

------
manigandham
Great to see this released. Some similar architecture decisions to Apache
Pulsar as well with the separate of compute (in this case the sequencer) from
the storage.

Kafka has done well so far, especially in making streaming systems more
common, but it's about time for the next-gen systems.

~~~
ashu
How does LogDevice differ from Kafka?

~~~
manigandham
Kafka brokers handle both the computation (partition/topic management,
sequencing, assignments, etc) and storage together. This coupling creates
scaling and operational challenges which LogDevice removes by separating the
layers. Storage nodes can be as simple as object stores (but optimized for
appending files) and use multiple non-deterministic locations for a given
piece of data to randomize placement. They read, write and recover data very
quickly by working together in a mesh.

Meanwhile the compute layer becomes very lightweight and almost stateless,
which is easy to scale. In LogDevice, the Sequencers are potential bottlenecks
but generating a series of incrementing numbers is about the fastest thing you
can do so it'll outpace any actual data ingest to a single log, while giving
you a total order of all entries within that log. The numbers (LSNs) follow
the Hi/Lo sequence pattern so if a Sequencer fails, another one takes its
place with a greater "High" number, so it's guaranteed that all of its LSNs
will be greater than the previous Sequencer as a result. This also provides a
built-in buffer to still accept messages and assign the permanent LSNs to them
after recovery in case a Sequencer fails.

Apache Pulsar is similar to LogDevice but goes further where brokers manage
connections, routing and message acknowledgements while data is sent to a
separate layer of Apache Bookkeeper nodes which store the data in append-
optimized log files.

~~~
Rapzid
Interesting. Microsoft's Tango paper had some interesting things to say about
sequencers/sequences as well.

------
posnet
Awesome, I have been waiting for this since seeing the @scale talk about it.
[https://atscaleconference.com/videos/logdevice-a-file-
struct...](https://atscaleconference.com/videos/logdevice-a-file-structured-
log-system/)

~~~
Rafuino
Is there supposed to be a replay of that talk on the site you link to or is it
just not loading for me?

~~~
manigandham
The event is hosted by Facebook and there is an embedded Facebook video player
on that page. Here's the direct link:
[https://www.facebook.com/atscaleevents/videos/19602876909109...](https://www.facebook.com/atscaleevents/videos/1960287690910992/)

~~~
Rafuino
Thanks! My script and adblockers must've broken the site.

------
StreamBright
The amount of great quality open source projects dein Facebook just keeps
growing. I really like the consistency guarantees:

[https://logdevice.io/docs/Concepts.html#consistency-
guarante...](https://logdevice.io/docs/Concepts.html#consistency-guarantees)

And it uses RocksDB under the hood:

[https://logdevice.io/docs/Concepts.html#logsdb-the-local-
log...](https://logdevice.io/docs/Concepts.html#logsdb-the-local-log-store)

------
adev_
Thank to Open Source that, it looks a great project.

Could a LogDevice give a bit of informations about the scale they use that at
facebook ?

\- How many record this thing can injest per day ? \- Any limitations on the
maximum number of storage nodes ? \- What would be your maximum and advise
size of record for a production usage ? \- ZooKeeper seems to be the center
point used as epoch provider. Did you encounter any scaling limitations or max
number of client due to that ?

~~~
AhmedSoliman
I cannot give you exact numbers, but here are some information that might be
useful: \- LogDevice ingests over 1TB/s of uncompressed data at Facebook. This
already has been highlighted in last year's talk in @Scale conference. \- The
maximum limit as defined by default in the code for the number of storage
nodes in a cluster is 512. However, you can use --max-nodes to change that.
There is no theoretical limit there. Each LogDevice storage daemon can handle
multiple physical disks (we call them shards). So, If you have 15 disks per
box, 512 servers. That's 7680 total disks in a single cluster. \- The maximum
record size is 32MB. However, in practice, payloads are usually much smaller.
\- Zookeeper is not (currently) a scaling limitation as we don't connect to
zookeeper from Clients (as long as you are sourcing the config file from
filesystem and not using zookeeper for that as well).

Hope that helps.

------
sandstrom
Very interesting!

I like the idea of decoupling compute from storage for streaming/log data.

I wonder if it would be easy to make it run under Consul, instead of
ZooKeeper.

~~~
AhmedSoliman
We use Zookeeper primarily for the EpochStore. This is the abstraction that
you can you use if you want to replace Zookeeper. It shouldn't be that hard as
long as Consul offers the same guarantees as zookeeper.

------
remh
Am i the only being puzzled by

 _Scalable_

 _Store up to a million logs on a single cluster._ ?

This sounds pretty confusing / low volume.

~~~
manigandham
logs = topics, so they mean 1M separate topics on a single cluster.

~~~
remh
Makes more sense. Thank you!

------
tryptophan
What benefit to facebook is there from open sourcing technology they have
developed?

~~~
martincmartin
Facebook's competitive advantage doesn't come from having the best reliable
streaming data store at scale, or from its software in general. Even if
MySpace, Friendster or Google + got their hands on the whole software stack &
started running it, people would stick with Facebook.

So there's no cost to open sourcing. The benefit comes from being known as
technically innovative in general, and for recruiting, being known as having
interesting, meaty, challenging projects to work on.

The impetus usually comes from team members who want to do the work. It could
be to become known for having worked on the project, or a sense of giving back
to the community, or a hope that you'll get bug fixes & features from outside
contributors. In my (very limited) experience, managers "passively encourage"
it -- they generally don't push the team to do it, but when the team asks,
they encourage it.

~~~
thrusong
If that's true, why haven't they open sourced Haystack? Clearly they're
holding onto it due to competitive advantage.

~~~
martincmartin
I don't know, but my guess is nobody associated with it wants to put their
other work on hold to make it happen. From my limited experience, nobody
pushes you, and nobody blocks you. So it depends a lot on the motivations of
the engineers on the project.

------
javiermaestro
Awesome to see this finally happening :)

Previous discussion in HN:
[https://news.ycombinator.com/item?id=15142266](https://news.ycombinator.com/item?id=15142266)

------
jMyles
I don't see anything about trust requirements or verification. Does LogDevice
assume that all devices in my cluster are trusted?

~~~
cedricvg
LogDevice uses SSL for authentication. This can be enabled for both clients
and servers [1].

[1]
[https://logdevice.io/docs/Settings.html#security](https://logdevice.io/docs/Settings.html#security)

~~~
jMyles
That's not what I mean though. What if I have a cluster with devices I don't
trust, but I want to let them emit logs if they conform to a particular
protocol. Like, will this thing check signatures for me and such?

Since it doesn't say anything about trustlessness, I assume that it assumes
that all nodes are trusted.

~~~
lexs
LogDevice is crash fault tolerant not byzantine fault tolerant if that's what
you're asking. This fault tolerance is in regards to where the logs are placed
not who's emitting them though. If you want to analyse logs for
inconsistencies or attack patterns you should look into something like
SEAMS/REAMS, it's completely out of scope for LogDevice.

------
Annatar
"bin/logdeviced"

All daemons and system administration utilities belong into sbin, because bin
is for end-user applications.

Historically, the "s" in sbin meant something else, but it always contained
applications and scripts only root could run.

When I see these examples, it's depressing to see just how much understanding
of UNIX is missing.

~~~
AhmedSoliman
Maybe sending a PR would help?

~~~
Annatar
It's for Linux only, and I run illumos-based SmartOS on my own infrastructure.

That's not the point. The point is that all these generation Y kids grew up on
PC buckets and still don't understand UNIX and the concepts behind it, and yet
they use it to power their applications. This can only end badly unless they
start making an effort to understand the concepts behind the substrate they
are writing software for.

~~~
javiermaestro
A few things. First, can't you run Linux inside a Solaris Zone? I don't know
much about Solaris stuff (although I do like it very much, I grew up mostly
with Linux, which you so much despise, and I'm not too familiar with other
Unixes). So... I think you could probably run Logdevice if you really wanted.

Then, here's my two cents. When engaging in conversation and civil dialog,
please try to avoid being so dismissive and so proud of yourself and of how
much you think you know about stuff. You come across as abrasive and entitled.
It's not nice to just jump into a conversation and talk trash about the work
of others just because you dislike the operating system that they use.

Finally, if you really care, work on porting it to your operating system of
choice and engage in civil conversation doing pull requests, etc. Everybody
will be thankful for that.

~~~
dang
> You come across as abrasive and entitled.

I recognize the good intention here, but if you're going to post this kind of
comment, please try to eliminate the personal provocations. They don't help,
and do hurt.

------
majidazimi
External logging service is my favorite way of doing replication. It provides
nice features. Specifically:

\- Cross vendor replication which makes migration much easier.

\- No dependency on vendor provided replication protocols.

\- Ability to use in-app databases such RocksDB, SQLite, ...

\- Upgrading DB nodes becomes way easier since they are totally separated from
each other.

------
cardosof
How does that fit in a ML training pipeline? (this is mentioned on the page)

~~~
manigandham
It's just streaming data but more scalable and with total ordering which can
be important for ML.

------
senderista
Sounds like it might have been influenced by the MSR CORFU project (separate
sequencer, write striping). Can anyone confirm?

~~~
noahdesu
It's hard to deny that there is at least some influence there. Like LogDevice,
the zlog project [0] is influence by CORFU (separate sequencer, write
striping), but both use different storage interfaces / strategies.

[0]: [https://github.com/cruzdb/zlog](https://github.com/cruzdb/zlog)

------
pedrorijo91
Is there any comparison with other similar storages?

------
SkyRocknRoll
This lot more similar to apache bookeeper.

------
silur
this is like....the harder half of a whole blockchain project :D super
interesting

------
polskibus
Is this a Kafka competitor?

