
Twitter open-sources a high-performance replicated log service - spullara
https://github.com/twitter/distributedlog
======
boredandroid
Several people asked how this compares to Kafka (I'm one of the people who
created Kafka at LinkedIn). Here's my take:

I think the motivations they list are: 1\. Different I/O model 2\. Was started
before Kafka had replication (the first release of Kafka with replication was
in late 2013 I think)

The I/O model I'm less sure about, we looked at similar things for Kafka and
they didn't seem worth it (basically you're doing a ton of stuff at the app
level that the OS does pretty well--namely caching and buffering linear I/O),
we'd have to look at actual benchmarks to know.

Here is my take on the pros and cons of the core tech.

Pros: \- Seems to have better built in support for fencing/idempotence \-
Better geo placement?

Cons: \- Lots more moving pieces. Already people are irritated that there are
both Kafka nodes and ZK to set up. This system seems to split this over
separate physical tiers for serving, core, storage, and zookeeper. My
experience has been lot's of tiers is generally a big headache.

Neutral: \- There seems to be a built in achival to HDFS. I think if the
consumer is fast and efficient then you don't need to reach around your
consumer api which will be high latency (since you have to wait for files to
be closed out).

There is also a bunch of stuff Kafka does that I'm just not sure about how
complete it is in DistributedLog: \- Clients in a bunch of languages \-
Integration with all the major stream processing frameworks \- Log compaction
[http://kafka.apache.org/documentation.html#compaction](http://kafka.apache.org/documentation.html#compaction)
\- Connector management [http://www.confluent.io/blog/announcing-kafka-
connect-buildi...](http://www.confluent.io/blog/announcing-kafka-connect-
building-large-scale-low-latency-data-pipelines) \- Quotas/throttling \-
Security/ACLs

------
duggan
People talk about Zookeeper as a negative, but in a quorum it's been one of
the most stable and reliable pieces of software I've deployed, despite being a
bit frustrating to set up / configure. Netflix's Exhibitor[1] is an
indispensable addition to it.

Also, once you're on the Big Data™ train, a lot of things like to plug into
Zookeeper, so it becomes more of a convenience.

Kafka, and presumably DL, are at their most useful when you're pushing the
limits of NIC and/or HDD performance for throughput. Zookeeper's configuration
is a footnote in the complexity of managing one of these systems, and lets
them avoid implementing their own byzantine coordination system. Also, folks
seem to appreciate Aphyr's opinion, and he states it pretty plainly: _Use
Zookeeper. It’s mature, well-designed, and battle-tested._ [2]

[1]
[https://github.com/Netflix/exhibitor](https://github.com/Netflix/exhibitor)

[2] [https://aphyr.com/posts/291-jepsen-
zookeeper](https://aphyr.com/posts/291-jepsen-zookeeper)

~~~
sambe
I found Zookeeper to be pretty easy to learn about and deploy. I also found it
got a semi-bad reputation with others because they did not want to learn
anything. Two major examples being:

* attempting thousands (tens of thousands?) of simultaneous writes across data centres/continents at the same instant and then saying it was slow. Subscribing all clients to updates on all parts of the tree. The architecture had been grown by people who didn't understand the guarantees and constraints.

* calling it unreliable after arbitrarily moving nodes around without changing connect strings and generally mis-configuring it. Essentially, pointing clients at machines that no longer contained nodes and blaming ZK for this not working.

It could be easier to setup. People often don't want to think about such
things at all. But it's also not the hardest, and I found it to be very
resilient to node and network failures when deployed correctly.

~~~
duggan
Agreed, some of the problems I encountered were because I did not carefully
read the documentation. Some are because a lot of tooling has become easier to
deploy in the years since Zookeeper was released, so the bar has been raised.

Which is just to say there's room for improvement. A dependency on Zookeeper
is fine if you've already got a configured cluster, and a cognitive speed bump
if not.

------
tcoppi
This could be an interesting competitor to Apache Kafka, which is singularly
unique in this space as far as I'm aware.

On another note, I find it somewhat funny that these are called "log"
services, logging is probably the least interesting use case for these things
I can think of. A better description in my mind would be as a distributed
event processing framework, since what they are really doing is distributing
discrete events in a reliable manner.

~~~
goejoegg
If you read the CS papers surrounding distributed systems, you will often see
the notion of a 'journal' or a 'log', meaning an append-only structure, which
typically contains numerous agreed-upon facts.

~~~
tcoppi
I agree that term describes the foundational data structure these services
provide, but in common parlance that is not really what it means at all and it
is confounded by a common(and boring) use case being to synchronize processing
of actual, text-based, logs.

~~~
sheriff
Jay Kreps, architect of Kafka, calls it a log.

[https://engineering.linkedin.com/distributed-systems/log-
wha...](https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying)

------
heavenlyhash
This looks potentially fantastic. If I could beg one wish from the developers
of this (and almost every other project anywhere near this space), though, it
would be one tiny piece of documentation:

 _What 's your unique ID scheme?_

Let's say I'm willing to believe[1] that you've got Durable and Consistent
down, once messages make it committed _in_ to the system. What's the story for
messages on their way in? My application logs are buffered to the local disk,
now I'm streaming them into central storage, and halfway through a TCP
connection that's shuffled 2mb of thousands of messages into storage, the
connection terminates -- unexpectedly, midmessage. Could the service have
committed _more_ messages than it acknowledged? Or many _less_ than I've sent?
Both could be true from the network standpoint.[2]

So, what I need to know, and what should be _very_ easy to answer, front-and-
center in your docs, pretty please:

1) Where should my log uploader resume?

2) Is there any danger of repeatedly entering some lines?

3) If I have log lines that are legitimately duplicates, will they be stored
at the correct count?

These are questions that may have a different answer than the durability
_after_ data makes it fully into the system. It also may provide useful
information about how complexity the code in a submitting client is, because
good answers tend to require some kind of ID sequence being assigned on
submitting clients, afaict. And it's really just plain critical to sanity.

\----

[1] well, no, I'm not, "trust by verify" in all things etc etc; but let's
suppose that's _more_ believable and something I have to mechanically verify
anyway, and doesn't have an obviously observable boolean at the protocol level
as to whether it's going to work well or not, and system _internals_ simply
don't have such a sordid history of being over-simplified until they're broken
like client interfaces so often are, so...! We'll handwave that to a later and
more involved step of quality investigation.

[2]
[https://en.wikipedia.org/wiki/Two_Generals'_Problem](https://en.wikipedia.org/wiki/Two_Generals'_Problem)

~~~
chillacy
I stumbled upon this page while reading the docs:
[https://twitter.github.io/distributedlog/html/design/main.ht...](https://twitter.github.io/distributedlog/html/design/main.html#consistency)

It looks like they use fencing and a two-phase commit to prevent duplicate
writes. Whether that covers all failure scenarios I'm not sure.

------
ryanobjc
I was interested until I saw the Zookeeper dependency.

I have had too many deployment nightmares with Zookeeper. I would prefer to
avoid it as much as possible, plus systems software in Java, sigh.

~~~
jlisam13
'plus systems software in Java', mind sharing an explanation?

~~~
ryanobjc
Java at first is great. Once you get in to large and complex loads, the GC
tuning becomes load specific. While tuning GC can be fun (for the first dozen
times), the bigger problem happens when your load shifts and your GC tuning
becomes subtly wrong.

Also even IF you are tuned properly, there is a world of 99-percentile you'll
never get to.

Additionally, the difference between openJDK and Oracle JDK becomes something
admins learn to hate you for. Complex shell scripts to invoke java, and lots
of standard unix tools just don't work super well. eg: pgrep and pkill. you
can tweak it, but it takes a little while to learn the many tips and tricks.

It's not all bad, most Java programs are deployed in a static-all-batteries-
included fashion, so you rarely worry about system-installed library versions.
So that makes deployment a little less hassle. You never have to recompile for
cross platform. The profiling tooling and other stuff is pretty good, and the
more you're willing to pay the better the tools get.

~~~
coredog64
I find that 'jps' works as a good replacement for pgrep.

~~~
ryanobjc
sure, but jps isnt pkill.

The point is you cant use binary name to make your way around anymore. The
binary name is 'java'. Standard unix tools just don't work. Yes there are work
arounds, but over time things end up being just a little more complex than
they should be.

Which means if someone is comparing zookeeper and etcd, well etcd wins major
points for being unixy and easy to deploy. Copy 1 binary, done. ZK loses major
points here. Gotta make sure the JVM is installed, but do you need the OpenJDK
or the Oracle one? if the latter, well apt-get and yum are less helpful.

It's all just little globs of annoying details that add up to be a small pain.
Nothing horrible, but if you could make a choice to avoid that, why not?

Basically I guess what I'm saying is there is probably a market for replacing
all the Javay distsys stuff with Rust/Go versions. I mean look at etcd!

~~~
lmm
> Basically I guess what I'm saying is there is probably a market for
> replacing all the Javay distsys stuff with Rust/Go versions. I mean look at
> etcd!

Or in the other direction. If your main system runs on the JVM and your
sysadmins are used to the JVM tools then having a piece of infrastructure
that's just another .jar is wonderful, and C/Rust/Go/Ruby/etc. infrastructure
elicits groans. Mixing platforms will always be harder than a common platform.
So the infrastructure market depends on where you think the future of
applications is.

------
crgwbr
Anyone know how this similar and different from Kafka? Or why Twittered decide
to build their own instead of using and contributing to Kafka?

~~~
murph
There's a great blog post on this: [https://blog.twitter.com/2015/building-
distributedlog-twitte...](https://blog.twitter.com/2015/building-
distributedlog-twitter-s-high-performance-replicated-log-service)

~~~
DannyBee
Which says " At design time we had concerns about Kafka’s I/O model and its
lack of strong durability guarantees‐a non-starter for an application like a
distributed transaction log[3]"

Seems reasonable, right?

Except "[3] Kafka addressed these durability concerns in version 0.8"

So they built a whole thing, because they didn't bother to ask or say "hey, if
we help fix the durability, would that we welcome?" or even "do you guys have
a plan and timeline to fix it?"

That's .... not great.

Now, maybe there are other reasons, but they aren't elucidated in this blog
post :P.

Even then, my general view would be "did you approach the community and
discuss your concerns or just dismiss them out of hand as infeasible", mainly
because my experience is that if you do this, you often find they have exactly
the same set of concerns/goals, and just need more resources to make it
happen.

The desire to build shiny objects is very large. Outside of the paper plans of
engineering teams, these things rarely end up up more shiny than what already
exists or will be built by the time you are done.

~~~
d0vs
Seriously this seems like a common pattern in open source: [big company] could
just improve [X] but instead builds something from the ground up.

~~~
vosper
Sometimes you need to let talented engineers build things from the ground up,
because it's good for them, makes them happy, and stops them from going to
work somewhere else. Keeping people with the skills to solve these types of
problems around and happy is also great for recruiting and for helping your
less capable engineers learn and grow.

~~~
jbooth
Sure, but duplicating Kafka? How many man-months did they put into building
this, proving it out, and dealing with fallout from any bugs or production
issues?

What's the point of retaining an engineer who's doing nothing for the business
but re-inventing existing successful software?

------
DenisM
Speaking of logs, I want to put some logging in place for my web server. I log
every single request with extensive details, so I can debug things later if
needed. It's several gigabytes per day now, so I can no longer just dump it on
disk as I did for the last couple of years.

Since I'm on AWS EC2, I want to try this:

    
    
      - Write the logs to local SSD, asynchronously 
        so as not hold back the http request.
      - Have a separate cron job that loops through 
        the log directory and scoops up all the files.
      - The job will then stuff those files into a Kinesis Firehose.
        AFAIK, Kinesis Firehose does not require any capacity provisioning, 
        unlike the Kinesis Streams, so I'm set "for life" (up to 5MB/second)
      - The firehose will accumulate the logs and put them into S3. 
        Hurray unlimited storage!
      - S3 will trigger a Lambda.
      - Lambda will parse through the log from S3, pull out 
        interesting properties (IP address, user id, session id, etc) and 
        stuff them into a DynamoDb table.
      - If I need to see data from one user/ip/session I will use DynamoDb 
        to find the right S3 blobs.
      - If I need to reprocess the logs to extract a new piece 
        of data that I did not foresee earlier, I can run a 
        map-reduce task
    

Except the last piece, this looks like something I can half-ass in a couple of
days and forget about it for another couple of years.

Any opinions? I don't really want to use a SaaS log service because gigabytes
per day.

~~~
d23
I've heard some bad things about Kinesis in general. Why not have your cron
job just put the logs onto S3 directly?

I've used Lambda a bit. The debugging process can be a pain, since you're
forced the upload a ZIP file, and if your code times out Lambda doesn't give
you any traceback to indicate what happened. There's also a maximum run time
of each Lambda invocation, which I believe is 5 minutes. Is there a chance
your parsing may run longer than that? Also, what will you do if you upload
some bad code and the parsing fails? Will it be the end of the world if you
lose data while you fix the parsing?

Oh, I see you plan on doing map-reduce to re-parse the logs, so maybe that
part isn't as big a deal.

You could also consider doing something like rsyslog -> db-of-choice while
also rotating the files off to S3 for long-term map-reduces. This is all to
ignore the obvious ELK cluster solution, which will give you good data
visualization and investigation options, but may be more of a headache to set
up and maintain than you are looking for.

Anyway, those are my thoughts. Hope they help.

~~~
DenisM
Thanks for sharing your thoughts, esp on Lambda.

For Kinesis, I planned to use Firehose, not Streams (the latter have to be
provisioned, which I was hoping to avoid). The firehose could put data into S3
for me, and S3 would trigger lambda. However I just realized that S3 will only
make 3 attempts to invoke the Lambda, so that pretty much rules out this part
of my design - the data will not get lost, but it will not get indexed either.
I may run map/reduce later, but I don't want to be dependent on doing that to
pick the loose ends.

These are just server logs, they don't affect business continuity. Still I
wouldn't want to be sitting there and wondering "is this user having
connectivity problems, or did I just lose a pile of logs?".

I could probably put the files directly into S3, got carried away stacking my
AWS features together. :) I'd need to be more careful with batching, so as not
to create batches too small or too large. Perfectly doable, though Kinesis
Firehose already does that for me. Plus in case of the EC2 instance death I
will lose the current batch with the hand-rolled solution, but not with
Firehose.

So I guess I should just put a batch of data directly to S3 and send an SQS
message to make sure that it's indexed properly, then delete the local files.

I really don't want to pick up another thing that I have to understand and
manage. Like ELK. Someone who knows ELK will probably have no problem managing
it, but my head is full with business domain problems.

------
tlrobinson
How does this compare to Kafka?

~~~
scaleout1
Here is a pdf going briefly into kafka comparison and design motivation
[http://goo.gl/J9XdsG](http://goo.gl/J9XdsG) . During my time at twitter I
remember when Twitter switch from Kafka to distributedlog. They have an
internal layer that adapts Kafka API to distributedlog, I am not sure if they
have open sourced that

~~~
gshx
Very nice, now that we have some good competition in this space - kafka, ddl
and wormhole. If only Fb can open src wormhole, there will be some good
related systems software to choose from and contribute to.

------
moondowner
By checking out the code quickly it looks like it's built on top of Apache
Bookkeeper [http://bookkeeper.apache.org/](http://bookkeeper.apache.org/) ?

~~~
tlrobinson
More info on the stack:
[https://twitter.github.io/distributedlog/html/architecture/m...](https://twitter.github.io/distributedlog/html/architecture/main.html#software-
stack)

~~~
moondowner
Nice docs, thanks!

------
vruiz
For those who like to watch talks, I found one on youtube:
[https://www.youtube.com/watch?v=QW1OEQxcjZc](https://www.youtube.com/watch?v=QW1OEQxcjZc)

------
stemuk
I am not that into this topic, but could you compare DL to Soundclouds
Roshi[0], and if not, whats different? Thanks! [0]
[https://github.com/soundcloud/roshi](https://github.com/soundcloud/roshi)

------
hvmonk
I wish they had kept some of the commit history - would be tough to parse the
project without it.

------
trungonnews
Does DL has a limit on the number of partitions like Kafka?

~~~
mgodave
A DL stream is a single partition if you are thinking in terms of kafka. We
stitch them together as a partitioned stream a la Kafka using a different
system. So, to answer your question, no, there is no limit to the number of DL
streams you can put together into a bigger logical stream, we have some very
large ones.

~~~
alexatkeplar
Ah! That actually explains the strange syntax I was seeing in the tutorials:

    
    
      // Create Stream `basic-stream-{3-7}`
      // dlog tool create -u ${distributedlog-uri} -r ${stream-prefix} -e ${stream-regex}
      ./distributedlog-core/bin/dlog tool create -u distributedlog://127.0.0.1:7000/messaging/distributedlog -r basic-stream- -e 3-7
    

You use a regex to progammatically create N streams, which would be N shards
in Kinesis or N partitions in Kafka.

------
merb
100% Java. I'm wondering. Why doesn't this has any Scala?

~~~
niccaluim
Rumors of Scala's dominance at Twitter are slightly exaggerated. While it's
true that the non-revenue-related backend is almost all Scala, on the ads eng
side, it's almost 100% Java. Which language a new project is written in has
more to do with who's writing it than anything else.

~~~
merb
Actually I just wondered cause of Finangle. I thought that their projects uses
the RPC service heavily.

~~~
niccaluim
Ah, Finagle has a Java API too.

------
swang
What? No bird related name for this project?!

~~~
sulam
Bird projects got outlawed during the interregnum.

------
alphadevx
Wonder how it compares to Facebook Scribe?

~~~
manojlds
> This is an archived project and is no longer supported or updated by
> Facebook

~~~
alphadevx
Yes but you can still use it, or fork it, it's actually a really cool
federated logging system.

~~~
ak4g
I hear, from people that would know, that it has a lot of bugs that were fixed
internally but never made it into an open-source release.

------
throwaway_xx9
No, actually Java is a bane to the database world.

Cassandra doesn't work, and Hadoop is a complete waste of hosts for most
companies (hence the move to Spark.)

~~~
nicobn
Blanket statements like "Cassandra doesn't work" and "Hadoop is a complete
waste of hosts for most companies" are unproductive and contribute nothing,
unless you can back them with data and real world examples.

So, what data do you base these assertions on ? Also, not to burst your bubble
but a lot of businesses (if not the majority) run Spark on YARN. And Spark is
built on the JVM.

~~~
pc86
If they had data and examples they would almost certainly have enough
experience not to say things like "______ doesn't work" and "______ is a
complete waste of hosts."

