
From Kafka to ZeroMQ for real-time log aggregation - janczukt
https://tomasz.janczuk.org/2015/09/from-kafka-to-zeromq-for-log-aggregation.html?term=12
======
thomaslee
I used to be on a team responsible for a single small-ish Kafka cluster
(between 6-12 nodes) doing non-trivial throughput on bare metal. Without
commenting on whether ZeroMQ is the right alternative: I can understand being
scared off. Our hand was forced such that we had to go the other way and
understand what was going on in Kafka.

The kicker is that Kafka can be rock solid in terms of handling massive
throughput and reliability when the wheels are well greased, but there are a
lot of largely undocumented lessons to learn along the way RE: configuration
and certain surprising behavior that can arise at scale (such as
[https://issues.apache.org/jira/browse/KAFKA-2063](https://issues.apache.org/jira/browse/KAFKA-2063),
which our team ran into maybe a year ago & is only being fixed now).

Symptoms of these issues can cause additional knock-on effects with respect to
things like leader election (we wound up with a "zombie leader" in our cluster
that caused all sorts of bizarre problems) and graceful shutdowns.

Add to that the fact the software is still very much under active development
(sporadic partition replica drops after an upgrade from 0.8.1 to 0.8.2; we had
to apply some small but crucial patches from Uber's fork) & that it needs a
certain level of operational maturity to monitor it all ... it's easy to get
nervous about what the next "surprise" will be.

Having said all that, I'd use Kafka again in a heartbeat for those high volume
use cases where reliability matters. Not sure I'd advise others without
similar operational experience to do the same for anything mission critical,
though -- unless you like stress. That stress is why Confluent is in business.
:)

~~~
BrandonBradley
I can attest to 'getting nervous about what the next surprise will be' with
Kafka. And I'm only dealing with a single node right now.

Kafka and Confluent Platform are very much still works in progress. I had to
patch Kafka Connect HDFS connector because a fix I needed was left out of the
last release. Be prepared to do something similar with any of Kafka's
components.

------
buster
To me it sounds like Kafka was not understood in full detail (maybe because
missing documentation or the high complexity) and they switched to a system
they build themselves. Naturally they know in full detail what is going on and
can set up the system as needed.

I am wondering if working on solving the actual problems with Kafka would have
been the better route. I've never used Kafka and i find ZeroMQ great, but
reading that their logging solution does drop log messages is a huge no-go for
operations. How can you claim to run a serious business and say "babies will
die" when you can't be sure to be able to find problems?

Because, when will you lose logs? Not in normal operation, but when weird
things happen. When networking has a hiccup. When Load on the system is too
high, so most likely when many people are using your service. Exactly when
shit hits the fan. And you just made the decision that it's ok to drop log
messages in such cases? That's not good.

I think you should either dive into Kafka/Zookeeper and fix your problems or
switch to another logging solution. You should probably just drop that non-
sense "streaming and real-time logs" requirement and live with a log delay of
a few seconds and build something _really_ stable instead of building
something inherently unstable. Honestly, just collecting syslogs on the core
vm and sending them to a central server would have been the better solution.
Better then looking into fancy real-time, streaming logs on a sunday night
because the system is having a breakdown and you can't even be sure that you
are not missing essential logs.

~~~
AYBABTME
One has only two choices in those situations: drop logs or block receiving
more logs. Given their availability requirements, I don't think that blocking
is a viable choice. So dropping logs seems to be the only sane choice here.
There's no other alternative really so I'm not sure about the consternation.

~~~
buster
What? No. You can just save logs on the disk and buffer them. Just dropping
logs or blocking because some network resource is not available are both
terrible choices and that's not how logging worked for the last decades.
Throwing away logs ist a major step backwards.

~~~
AYBABTME
If you buffer to disk, the same problem will eventually show up. Queues (in
memory, on disk, anywhere) are all ultimately bounded, and when they are full,
you have 2 choices: block or drop. Somehow you need to make the choice,
there's no getting away from it.

If you don't make the choice consciously, say by assuming that you can buffer
to disk and avoid the problem, at some point you'll fill up your disks and
your system will block: you'll have unknowingly picked the "block" option. If
you decide to rotate logs and delete old rotations when too many logs are
present, then you're picking the "drop" option...

~~~
buster
That's why you aggregate logs in a central service. I was writing about
sending logs to a central service and not about how your disks fill up with
more logs. There is log rotation for that and usually your logs will have been
sent way before any log rotates. If your log rotation deletes logs before you
aggregated them or if you let your disks fill up with logs you have a much
bigger problem you should fix, of course.

~~~
AYBABTME
That's again the same problem; if your centralized service isn't reachable for
whatever reason, your nodes can buffer for a while (in memory or on disk) but
eventually the problem always will boil down to 'drop' or 'block'. However you
construct it, somewhere you need to make that call. They made the call to drop
logs, it's totally fine.

~~~
buster
If your central infrastructure is down for many days you have other problems.
Buffering logs on disks. Rotating and zipping files doesn't take much space.
You can buffer a lot before you run into trouble. That's my point. You sound
like buffering on the node is only possible for seconds whereas in real world
scenarios log files are written for days, weeks or even months. Even in very
large deployments and lots of logs. You would make the decision to don't use
this advantage and throw logs away for no good reason.

------
agentgt
I don't understand why people need such ridiculously fast systems when we are
using RabbitMQ and crappy Apache flume and we generate more than 5k with
spikes of 50k messages/second. Please author of the article tell me your
metrics.

And our log messages are ridiculously big at times (15k to as big as 50k).

Our pipe never has problems. What fails for us is Elastic Search. In fact at
one point in the past we did 100k messages/s when embarrassingly had debug
turned on in production and RabbitMQ did not fail but Elastic Search and sadly
Flume did as well (I tried to get rid of flume with a custom Rust AMQP to
Elastic Search client but at the time had some bugs with the libraries.. Maybe
I will recheck out Mozilla Heka someday).

There is this sort of beating of the developer chest with a lot of tech
companies.. that hey listen we are ultra important and we are dealing with
ridiculously traffic and we need ultra high performance. Please tell/show me
these numbers.... Or maybe stop logging crap you don't need to log.

Or maybe I'm wrong and we should log absolutely everything and Auth0 made the
right choice given their needs (lets assume they have millions of messages a
second), I still think I could make a sharded RabbitMQ go pretty far.

This goes with other technology as well. You don't need to pick hot glamorous
NoSQL when Postgresql or MySQL and a tiny bit of engineering will get the job
done just fine particularly when mature solutions give you such many things
free out of the box (RabbitMQ gives you a ton of stuff like a cool admin UI
and routing that you would have to build in ZeroMQ).

~~~
packetized
We run an average of 14k logs/sec through a two-node RMQ cluster, with max
sustained throughput in the ~50k range. You're spot on with the bottleneck
being Elasticsearch, but the latest releases in the 2.x train have a lot of
fine adjustments that have drastically improved our indexing rate, such that
we actually index at a 50k/sec rate. Would be interested to hear about your ES
cluster configuration.

~~~
agentgt
I'm embarrassed to say that at the present moment we currently don't use ES
clustering but rather a monstrous powerful bare metal machine as we had issues
with the cluster failing with some network issues we had with Rackspace.

BTW I didn't mean to denigrate Elastic Search (I assume that is why I'm
getting downvoted.... a comment would help). We just haven't had the chance to
upgrade it and properly configure it.

In fact Elastic has been pretty darn speedy as of lately particularly since we
purge some of the data after 6 months (we still have permanent filesystem
storage of logs of course).

~~~
gerakinis
You can turn off multicast discovery and write in unicast peering addresses.
If you are in the cloud and you are clustering this is step 1 =)

------
wcdolphin
Did you ever try running 5 ZK's in the ensemble? 3 is the absolute minimum to
survive a single machine failure. If you are having trouble with availability,
it seems natural to increase your safety factor there.

I was surprised by the contrasting sense of importance of delivery guarantees
in the article. At the start, losing a message was akin to the death of a
child. At the end, _shrug_. Now every single machine failure (or even ømq
process restart) failure will lose you log messages stored in memory :(.

Glad to hear you found a solution that worked for you though! Would love to
hear about difficulties you had with the new system, in particular adding
brokers.

~~~
_qc3o
They said availability was "death of a child", not dropping log messages. The
trade-off they've made here in terms of being available with some potential
loss of visibility is the right one. The system overall is clearly simpler and
simpler systems have simpler failure modes and so it is easier to add
mitigation components on top that can recover from those failure modes to
guarantee higher uptime.

I've never heard anyone say managing a production Kafka cluster was easy or
simple. Well, anyone who has had to actually maintain such clusters hasn't
said it anyway.

~~~
fauigerzigerk
_> They said availability was "death of a child", not dropping log messages._

True, but it appears to me that availability problems and dropped log messages
often have the same root cause - network issues.

So whenever they do have availability issues (and dying babies) they won't be
able to investigate properly because log messages are being lost as well.

That's obviously a very general observation. It may well be that in their
architecture availability issues are mostly caused by something unrelated to
networking (e.g. the database).

~~~
dkarapetyan
It would be quite simple to have a two tiered approach to the logging problem
since they have separated it into 2 components. One can just write and ship
files while the other is what they have described in terms of providing real
time streaming.

So the question then becomes what are the failures modes of their logging
setup in terms of misbehaving clients? I don't know how kafka handles
misbehaving clients. I suspect it would lead to global effects and slowdown of
the entire cluster because of 1 or 2 misbehaving clients whereas in the
current set up local misbehavior will be localized to the nearest aggregator
dropping messages. Simple memory usage and other kinds of monitoring can then
be used to find these issues and then mitigate them accordingly.

This is still a heck of lot simpler setup than using kafka and worrying about
all sorts of weird distributed system failure modes. I'm sure kafka got them
started initially but continuing to use it is like using a sledgehammer to
kill a fly. For the use case they have this setup is the correct one and
migrating to kafka if it becomes necessary will be possible. So in my view
this is proper engineering. They've made all the right trade-offs instead of
just chasing fads and trends.

~~~
fauigerzigerk
_> It would be quite simple to have a two tiered approach to the logging
problem since they have separated it into 2 components. One can just write and
ship files while the other is what they have described in terms of providing
real time streaming._

Yes, they absolutely could do that, but they apparently don't. And maybe
that's because they would lose a lot of the simplicity they won by ditching
Kafka.

Anyway, I didn't want to defend Kafka specifically. The one time I considered
it, I ended up not using it because it seemed too heavy weight for my use case
in terms of memory usage and complexity.

------
TheHydroImpulse
FYI, Kafka doesn't need to fetch from disk every time as it caches the logs
pretty aggressively, as long as you have enough memory.

Running Zk and Kafka on the same nodes is likely not the best thing.

~~~
im_down_w_otp
Why? I would think that, as long as there wasn't massive I/O contention
between the two, that co-locating Kafka and Zookeeper on the same machines
would mitigate a whole massive class of weird edge cases by removing one of
the failure modes; the network boundary between the two critical components.

Though for my part I still don't understand why Zookeeper wasn't built as a
library to add distributed strongly consistent coordination to software that
needs/benefits from it rather than being an external service that needs to be
connected to, and thus introduces a gnarly mess of new failure modes that make
Zookeeper client behavior extremely critical and often fragile. Something
that's more like a "libpaxos/libraft" (e.g. serf for Go-lang or riak_ensemble
for Erlang) seems a lot more valuable. /shrug

~~~
TheHydroImpulse
But co-locating them won't actually remove a class of errors because Zk is not
HA. The Kafka brokers need to communicate with the leader in the Zk cluster.

If we have K1,Z1 -- K2,Z2 -- K3,Z3 -- and one node goes down, you've now taken
down both a broker and a Zk node. Remember, the brokers don't care about
connecting to _any_ Zk node, they want the leader. So you aren't gaining any
more fault tolerant by co-locating them.

If there's a network partition between the leader Zk node and other nodes, the
local Kafka broker won't actually be able to do much because the Zk cluster
will elect a new leader, on another node, so again, you aren't gaining
anything.

Moreover, you're now tying the scalability of Kafka with Zk. Zk doesn't scale
linearly, so there's only so many nodes you may have in a cluster. Kafka, on
the other hand, scales linearly. So if you're colocating them and you have to
bump up Kafka, do you still start up Zk for those nodes (but they don't
actually join the cluster)? You're now special casing and adding more edge
cases.

------
htn
FWIW, you can get Kafka packaged as a fully managed and HA service from
[https://aiven.io](https://aiven.io) on AWS and also Azure, GCE and
DigitalOcean.

But if the Auth0 runs their entire operations on AWS, maybe Kinesis would have
been a more natural transition.

~~~
janczukt
We need an on-premise and cloud story, so cloud only solutions did not cut it
for us.

~~~
PieterH
The article is a little old. How has the system run since you deployed it? Do
you have any interesting figures?

~~~
janczukt
It continues to run beautifully. Since we rolled it out back in 2015 we had
zero issues with real time logging. I have particularly fond memories of the
first week after rollout, it felt like vacation. I finally could get some
sleep.

------
StreamBright
The author correctly points out that he is comparing apples to oranges.

Kafka gives you features that certain systems cannot live without, like on
disk persistence (saved my life couple of times) and topics. Filtering
messages on the client side like ZeroMQ does it not an option in many cases,
just think about security. I think Kafka has a long way to go before it can be
used as a general message queue (many features are not there yet like
visibility timeout for example) but if you can manage Zookeeper and have means
to work with it (somebody understands it and knows its quirks) it can provide
a reliable platform for distributing a large number of messages with low
latency and high throughput, just like it does at LinkedIN.

------
bachback
With ZeroMQ I had the worst possible results and experience. Honestly much of
what it claims is bogus. It is highly optimized for certain cases and utterly
useless for distributed systems. Try and find out in PUB/SUB what the IP
addresses of the subscribers are. Not possible. In many cases you will be much
better off learning TCP/IP yourself. In the mentioned case you simply iterate
over the vector of subscribers - much more powerful and the sane default. It
seems at some point people confused internal networking solutions with the
Internet.

~~~
vegabook
it is trivially easy for any node to broadcast its IP address to the whole
network periodically (in my case every 2 seconds) using a separate thread and
UDP. Using this technique I have rock solid ZeroMQ topology that reconnects
with max downtime about 2.5 seconds (because I broadcast every 2 sdconds) for
any single node failure. I agree that this functionality could be better
implemented in zmq but using this simple technique, the rest of zmq becomes
amazing. In Python:

    
    
      import socket
      import time
      cs = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
      cs.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
      cs.setsockopt(SOL_SOCKET, SO_BROADCAST, 1)
      while True:
          cs.sendto('Node ID', ('255.255.255.255', 54545))
          time.sleep(4)
    

Everybody listening on the same on port 54545 without knowing Node ID's IP
address will get these messages which includes the broadcaster IP address.

    
    
      import socket
      s=socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
      s.bind(('',54545))
      m=s.recvfrom(1024)
      print m[0]
    

This is a very useful technique when using ZeroMQ generally as you can
broadcast services without knowning any IP address so they can come up and
down on new addresses if / when necessary.

~~~
mej10
This would be great, but I don't think it works on AWS -- I don't think they
support broadcast.

~~~
kchoudhu
My ZeroMQ components all register themselves in a database when coming and
going. This makes it trivially easy to find where stuff is just by running a
bog simple database query.

Lots of ways to skin this cat...

~~~
oldmanjay
What if they crash without registering an exit?

~~~
vegabook
They broadcast every 2 seconds. No heartbeat = dead.

~~~
oldmanjay
So there's more machinery waiting in the wings, not just the bog-simple query.
Presumably you've barely described all of the mechanisms, and I won't bother
socratically making the point that handwaving complexity away isn't the same
thing as simplifying. I'll simply state it.

~~~
vegabook
The above code is wrapped in a thread and runs non stop in each node. It's
really not very complicated. It's basically sticking a def around the top
block of code and then thread.start() in the main. It's extremely cheap
because it sends then sleeps for a few seconds. Any node than then listen to
the broadcasts on a known port (but no IP address needed) and will know
exactly what the state of the system is and all the IPs. Then in the main loop
you just make sure that you've gotten a recent pulse otherwise you reconnect.
If you're using zmq it's just another short poller entry amongst the others
that you will already inevitably have. Literally a couple of lines. It allows
you to bring nodes up and down at will for robustness and scalability.

I think it's much less complex than running an entire database node just for
this, which btw will also require you constantly to poll, and will require you
to bring in an (often heavyweight) client library into each node too, as
opposed to standard-library sockets which if you're running multinode you're
almost certainly already importing. If you're looking for "simple" distributed
computing my sense is that that has yet to be invented.

------
_halgari
ZMQ's default behavior (and in some cases only behavior) of dropping new
messages when buffers are full, made it a no-go for my client. We ended up
switching away from ZMQ to a more traditional durable queue and ended up
saving a ton of code complexity and got a lot of reliability in the process.
Having now researched it I can't think of a reason I'd ever use ZMQ again.
I'll either use a durable queue when I care about message delivery, or
something much more traditional when I don't.

------
markpapadakis
Maybe TANK ( [https://github.com/phaistos-
networks/TANK](https://github.com/phaistos-networks/TANK) ) would have been a
good alternative on there. No features parity with Kafka, but setting it up is
a matter of running one binary and creating a few topics, and it is faster
than Kafka for produce/consume operations. (disclosure: I am involved in its
development).

------
siscia
Did you consider MQTT? Sound to me a more natural choice.

------
jpgvm
Probably should have been running ZK and Kafka queues separate to
CoreOS/container shenanigans.

If deployed using the Netflix co-processes both are very durable.

------
Nimimi
You can deploy Kafka using DC/IO and it takes care about HA for you. DC/IO is
quickly becoming the go-to solution for database deployments. ArangoDB even
recommends it as default.

Now about Kafka vs ZeroMQ: you want Kafka if you cannot tolerate the loss of
even a single message. The append-only log with committed reader positions is
a perfect fit for that.

~~~
oblio
Do you mean this? [https://dcos.io/get-started/](https://dcos.io/get-started/)
aka DC/OS?

From what I can see it doesn't really support database deployments except for
ArangoDB and Cassandra.

~~~
ryanmaclean
Riak and MySQL are in Universe, for example:
[https://github.com/mesosphere/universe/tree/version-3.x/repo...](https://github.com/mesosphere/universe/tree/version-3.x/repo/packages/M/mysql/0)

------
k__
I'm a total message queue noob. What are the usecases for them?

I used MQTT but only as a message bus.

~~~
zo1
From my point of view, the main things behind message queues (Not zMQ
specifically) is guaranteed delivery, persistence, multiple-message atomicity,
message passing/forwarding, and sometimes guaranteed message ordering. Other
than that, all it does is facilitate communication between different actors.

Nothing magical/weird about it, just depends on whether or not you've got a
nail to hammer with your MQ-hammer.

------
weitzj
Did you look at nsq.io or NATS?

~~~
tjholowaychuk
+1 for NSQ, it's not a magic bullet in terms of scalability but you can get
quite far. When I was at Segment we were pushing an easy 2-3B messages per day
through it, if not more with message "amplification" internally.

------
manigandham
Why dont all these companies ever just use real enterprise software?

There are about a dozen message systems out there that will handle much more
than Kafka with minimal or no operational overhead while supporting everything
they need.

------
wanderr
I came up with a very different solution for real time access to logs: tail
them to slack. It's not an aggregation solution and doesn't work well if you
have chatty logs with nothing to filter on, but if you just want to be
notified when things are happening in the logs it's pretty nice and doesn't
need any infrastructure.

[http://wanderr.com/jay/tail-error-logs-to-slack-for-fun-
and-...](http://wanderr.com/jay/tail-error-logs-to-slack-for-fun-and-
profit/2016/05/12/)

~~~
wanderr
why the downvote? the article says "Real-time access to server-side logs is
what makes backend development palatable in the era of cloud computing. As a
developer you want to be able to get real-time feedback from your server side
code deployed to the actual execution environment in the cloud, especially
during active development or staging." and this is another solution that
provides that.

------
jvoorhis
2015

------
efangs
Anyone use collectd + rrd for this purpose? Still trying to understand at what
level it's worth to move to something else.

------
asasidh
So you used Kafka for something that should have been handled by a MQTT or
ZeroMQ in the first place ?

~~~
cbsmith
MQTT is just a protocol, so not sure how that helps.

0MQ doesn't sound like it is the right solution either, but yeah... often you
pick the wrong tool and learn something in the process.

------
bdowling
But why ZeroMQ and not nanomsg?

~~~
PieterH
See [http://hintjens.com/blog:112](http://hintjens.com/blog:112) for my
opinion on why nano isn't (wasn't, perhaps, as it seems to be doing better) a
good choice.

~~~
kal31dic
With sincerely the greatest respect and admiration for what you achieved with
ZeroMQ, Pieter, I think perhaps one might be a bit more nuanced when one isn't
a neutral party. Full disclosure - I wrote a D wrapper, but I am not involved
in nanomsg development and just a user.

There was some drama when the maintainer quit briefly before rejoining. Since
then the gitter channel has been more active than I remember it being before.
The mailing list is quiet it is true. Somebody just released a Rust version,
and version 1.0.0 of nanomsg was indeed released.

You can see commit history here:
[https://github.com/nanomsg/nanomsg/commits/master/src](https://github.com/nanomsg/nanomsg/commits/master/src)

~~~
PieterH
Thanks for the updates. I've edited my comment. I've always wanted nano to
succeed, just disliked the negative attitude to ZeroMQ expressed in its docs,
which seemed unnecessary and damaging.

~~~
kaleidic
Yes, well the little guy always wants to unseat the dominant player, and when
there is some personal history involved, things get more mixed up (technical,
emotional, manner of expression). In your shoes I would be irritated by that
too. But a project if it develops eventually transcends the personalities
involved, and that seems to be happening now.

~~~
PieterH
ZeroMQ is hardly dominant. It's a small player in a huge market and there was
and is space for many more projects in this area. I'm glad nano seems healthy
again, yet it's not enough, and I'll explain why.

In the end the whole point of ZeroMQ was to build new protocols and APIs for
decentralized messaging. My real disappointment with nano was that it made
zero effort to build on existing work (mainly, ZMTP) and instead just started
again, as if thousands of people hadn't spent years figuring out what a
decentralized messaging protocol might look like.

It was worse than that, in fact. Nano launched itself on a wave of negativity.
It makes good press, and poor everything else. Such hate for one's own history
and knowledge base isn't healthy. If a messaging product isn't aiming at
_interoperability_ as a primary goal, it is worthless.

I'm not a fan of the "IETF or bust" approach either. That just doesn't work if
you're a decentralized community. We needed and still need lightweight
processes for RFC development. We use such a process (Digistan's COSS) in our
RFCs. It wasn't random. I built Digistan and COSS over years after seeing AMQP
swallowed up and destroyed by a committee. Why isn't nano using a process like
COSS?

Without interoperable protocols, all we have is a bunch of software projects.
And they die. And then all this is for nothing and the proprietary systems
will rule the world and our dreams of making distributed software cheap again
will die as well.

And this makes me angry: nano had the chance to push this forwards, and threw
it away like old trash. What a stupid, petty waste of opportunity and
goodwill.

