
LogZoom: A fast, lightweight substitute for Logstash/Fluentd in Go - whost49
https://packetzoom.com/blog/logzoom-a-fast-and-lightweight-substitute-for-logstash.html
======
latch
Operationally speaking the single most important thing you should be doing is
collecting application and system logs and having them easily accessible and
usable (and check your backups every now and again). I say this with respect
to the value you gain in comparison to the relatively small costs. You're
being your own worst enemy if you aren't staying on top of error logs.

The OSS solutions are mature and simple to setup. And it isn't something you
need to get absolutely correct with 100% uptime. If you're an "average"
company, a single server running Logstash+ES+Kibana is probably good enough.
There's only two ways you can do this wrong: not doing it at all, or
forwarding non-actionable items (which creates a low signal-to-noise and
people will just ignore it).

After that comes metrics (system and application), which is important, but not
as trivial to setup.

Quickly looking at LogZoom, I think the more forwarding options we have, the
better. They make it very clear that, unlike Logstash, this doesn't help
structure data. On one hand, I don't think that's a big deal. Again, if you're
only writing out actionable items, and if you're staying on top of it, almost
anything that moves data from your app/servers onto ES+Kibana (or whatever) is
going to be good enough.

On the flip side, adding structure to the logs can help as you grow.
Grouping/filtering by servers, locations, types (app vs system), versions...is
pertty important. I like LogStash, I actually think it's fun to configure (the
grok patterns) and it helps you reason about what you're logging and what
you're hoping to get from those logs.

~~~
chetanahuja
PacketZoom founder here. Glad you liked the project. Could not agree more with
the importance of tracking logs (and metrics... but that's a topic for another
post).

To respond to your point about absence of Grok like facility, avoiding the
need to unmarshal and remarshal the data while passing through LogZoom was an
explicit design requirement. The blogpost refers to our pain with
Logstash/Fluentd etc. We were in a situation where our production code was
fighting for resources against a log collecting facility.

In general, it's best to process the data (to the extent possible) closest to
it's point of origin. It's orders of magnitude cheaper to create a well
structured log line straight from your production code (where it's just some
in-memory manipulation of freshly created strings) rather than in a post-
processing step inside a separate process (or machine).

I've spent years dealing with performance problems in global scale production
stacks and a surprisingly high number of resource bottlenecks (memory/CPU/Disk
IO) etc. are caused by ignoring this simple principle.

I've lost count of the cases where a simple restructuring of the architecture
to avoid a marshal/unmarshal step drastically cuts down resource requirement
and operational headaches. Unfortunately a whole lot of industry "best
practices" (exemplified by the Grok step in Logstash) encourage the opposite
behavior.

~~~
jsmeaton
I think you make a good point that logs should be transformed closer to the
source. I work, primarily, with applications provided by a vendor, with very
unstructured log data. Transforming (Grok) these logs is an absolute must, we
couldn't look at something that didn't allow transformation. That said, maybe
we should be looking at something closer to the source before handing it off
to a central location. Are you aware of agent-like daemons that do
transformation before handoff?

~~~
seanp2k2
Structured logs are awesome and a great idea. For the next few decades while
standards come and go and everyone gets it all implemented across the board,
yes it sucks to write grok patterns for the flavor of the week, but once you
do it a few times, it takes maybe a few hours of work to get some app cluster
with moderately logging flowing into ES with all the right types and all the
edge cases accounted for. From there, ELK is such a Swiss Army knife that it's
worth the trouble, since then it's e.g. trivial to fire PagerDuty alerts off
if you hit some exception-level log lines, or post metrics about your logs, or
put them on some queue to flow into some big data pipeline thing.

------
azylman
I wonder if the considered Heka
([https://hekad.readthedocs.org/en/v0.10.0/](https://hekad.readthedocs.org/en/v0.10.0/)),
made by Mozilla? It's written in Go and, as far as I can tell, solves many of
the same problems and more.

~~~
robbles
I looked into heka a while back and was turned off by its seeming reliance on
lua scripts for almost every feature. It just seemed overcomplicated to deploy
and maintain for a Go app as a result, and I didn't understand why all that
functionality didn't come built in.

Is this unfair? Perhaps I misunderstood the docs.

~~~
azylman
If you want to write your own plugins, you can do them in lua or Go. Lua is
the recommended way for most types - I think matching is supposed to be
faster, and also then you don't need to recompile the heka binary.

If you don't want to write your own plugins, you'll never interact with lua.

~~~
robbles
Don't you need to ship all those builtin lua plugins around with your
deployment of heka though? That was what turned me off - most of the other
options can be used with a binary + config file.

------
makapuf
What about simple rsyslog ? I stumble on this kind of programs (others have
mentioned heka, fluentd, logstash), but the general speed, simplicity,
versatility -the feature range is actually quite big from ES output to unix
pipes to simple filters - and ubiquity of rsyslog make it suited for many of
these tasks. I am missing something ?

~~~
lobster_johnson
We would switch away from Rsyslog in a heartbeat if someone could come up with
a better syslog-compatible forwarder.

We have it set up to write logs locally (with a limited rotation) as well as
forward them via TLS to a central Rsyslog server that collects the log in a
single tree with a much longer retention time. (We don't use any of the non-
file outputs, but we do sync to S3 for archival.)

It has major issues. For one, its spooling implementation is flaky. /dev/log
is a limited, synchronous-blocking FIFO buffer, which means that everything
that logs (including OpenSSH!) will choke if the buffer is full. For some
reason, just a tiny bit of packet loss will throw Rsyslog.

It also frequently is unable to recover from a network blip, and a restart is
the only solution. But its spool file is badly implemented, so on restart it
will typically ignore the old spool files and start anew — meaning you lose
data. Someone wrote a Perl script to fix a broken spool directory, but I never
got it to work.

Ironically, Rsyslog is also terrible at logging what it's unhappy about at any
given time, so whenever something bad happens, you probably won't get anything
in the system log.

Rsyslog's configuration is a curious beast, and by curious I mean infuriating.
Rsyslog originally had an antiquated, ad-hoc and messy line-oriented
configuration file format (with directives like "$RepeatedMsgReduction off"),
and author decided to transition to a more modern, block/brace-based syntax.
Unfortunately, he decided to do this gradually, and both syntaxes can co-exist
in the same file. For a while, many of the options were only available in the
old syntax, so you had to mix the two.

Which leads me to the next problem: The documentation is absolutely atrocious.
The Rsyslog site is a fragmented mess of mostly outdated information. It's
gotten better with v8, but it's still the worst OSS project documentation I've
encountered. There's no reference section that lists the possible config
options. Frequently there is _no_ documentation for a particular setting.
Rsyslog is quite finicky about some combinations of options (like TLS driver
configs) and you have to proceed by trial and error. Frustratingly, it will
silently ignore some config errors (such as trying to set up multiple TCP
listeners, which is still not supported).

The new config format is better, but it still has the feel of something that
has been implemented before it was fully designed.

Again, we don't use any of the fancy output modules. Maybe they are solid, but
based on my experiences with the simple file-based stuff, I wouldn't bet on
it.

As an aside, it's worth pointing out that Rsyslog is still using the Syslog
protocol, which has all sorts of issues (not consistently implemented by
clients or servers; does not support multi-line messages). Rsyslog has another
protocol, RELP, that I believe you can use for forwarding, but I don't think
it's been implemented outside of Rsyslog.

As far as I can tell, there aren't any good alternatives. syslog-ng's
forwarding support is commercial and quite expensive. Logstash might work, but
I don't want to run a memory-hungry Java app on each box.

~~~
Karunamon
Not sure what you mean by forwarding support - my entire environment is
configured with syslog-ng forwarding to various places based on various rules.

Heck of a lot faster than rsyslog. The one thing I've not been able to do is
get rsyslog forwarding to syslog-ng. Something happens to the message format
between systems that leads to _hilariously_ incorrect filenames on the
collector systems.

~~~
lobster_johnson
By forwarding I mean reliable, disk-buffered forwarding. This only exists in
the commercial "Premium Edition" of syslog-ng.

~~~
b0ti
There is disk based buffering in NXLog CE. You might want to check it out with
respect to the other woes you have with rsyslog.

~~~
lobster_johnson
Never heard of that one, thanks.

------
andygrunwald
We at trivago had a similar problem. For this we created Gollum: \-
[http://tech.trivago.com/2015/06/22/gollum/](http://tech.trivago.com/2015/06/22/gollum/)
\- [https://github.com/trivago/gollum](https://github.com/trivago/gollum)

We use it heavily to stream any Kind of Data into Kafka: Access and errorlogs,
Application Logs, etc. Did you consider it as well?

~~~
whost49
No, we did not consider Gollum--it definitely looks like a possible solution
and one we might have considered. I think the name of the project makes it
hard to find, unfortunately.

------
andrewvc
Hi all, Logstash developer here. It's always exciting to see new stuff in this
space, however, this post has me confused. Maybe the OP can clue me in.

I'm a bit confused as the assertion "This worked for a while, but when we
wanted to make our pipeline more fault-tolerant, Logstash required us to run
multiple processes.", is no more true for Logstash than it is for any other
piece of software. Single processes can fail, so it can be nice to run
multiples. It would be great if the author of the piece had clarified that
further. If you're around I'd love to hear specifically what you mean by this.
Internally Logstash is very thread friendly, we only recommend multiple
processes when you want either greater isolation or greater fault tolerance.

I don't personally see what the difference is between:

Filebeat -> LogZoom -> Redis -> Logstash -> (Backends)

and

Filebeat -> LogStash -> Redis -> Logstash -> (Backends)

or even better

Filebeat -> Redis -> Logstash -> (Backends)

You can read more about the filebeat Redis output here:
[https://www.elastic.co/guide/en/beats/filebeat/current/redis...](https://www.elastic.co/guide/en/beats/filebeat/current/redis-
output.html)

~~~
whost49
> If you're around I'd love to hear specifically what you mean by this.
> Internally Logstash is very thread friendly, we only recommend multiple
> processes when you want either greater isolation or greater fault tolerance.

Right, we considered using multiple Logstash processes, but we really didn't
want to run three instances of Logstash requiring three relatively heavyweight
Java VMs. The total memory consumption of a single VM running Logstash is
higher than running three different instances of LogZoom.

We looked at the Filebeat Redis output as well. First, it didn't seem to
support encryption or client authentication out of the box. But what we really
wanted was a way to make Logstash duplicate the data into two independent
queues so that Elasticsearch and S3 outputs could work independently.

~~~
andrewvc
Thanks for the thoughtfully considered response :).

Regarding security with redis. Did you read the docs here?
[https://www.elastic.co/guide/en/logstash/current/plugins-
out...](https://www.elastic.co/guide/en/logstash/current/plugins-outputs-
redis.html) Logstash does support Redis Password auth (as does Filebeat).
Regarding the encryption with redis point, seeing as Redis doesn't support SSL
itself, are you using spiped as the official Redis docs recommend?

Regarding the two queues, I would like to clarify that you can do this with
the:

Filebeat -> Logstash -> Redis -> Logstash -> (outputs) technique.

If you declare two Logstash Redis outputs in the first 'shipper' Logstash you
can write to two separate queues. And have the second 'indexer' read from
both.

It is true that if one output is down we will pause processing, but you can
use multiple processes for that. It is possible that in the near future we
will support multiple pipelines in a single process (which we already do
internally in our master branch for metrics, just not in a publicly exposed
way yet).

Regarding JVM overhead. That's a fair point about memory. The JVM does have a
cost. That said, memory / VMs are cheap these days, and that cost is fixed.
One thing to be careful of is that we often times see people surprised to find
that they get a stray 100MB event going through their pipeline due to an
application bug. Having that extra memory is a good idea regardless. We have
many users increasing their heap size far beyond what the JVM requires simply
to handle weird bursts of jumbo logs.

~~~
whost49
Thanks for that information. There's no doubt Logstash can do a lot, and it
sounds like with the multiple pipeline feature Logstash will make it easier to
do what we wanted to do in a single process.

In the past, we've also been burned by many Big Data solutions running out of
heap space that adding more processes that relied on tuning JVM parameters
again did not appeal to us.

------
shlant
So as someone who is just about to implement Fluentd, what is the status of
using LogZoom with docker?

Currently, with Fluentd all I have to do is set the log-driver and tags in
DOCKER_OPTS, point fluentd to ES, and I have all my container logs.

Does LogZoom work this seemlessly with docker? I know that at the very least I
will need
[https://github.com/docker/docker/issues/20363](https://github.com/docker/docker/issues/20363)
in order to implement any LogZoom plugin, so is this really much of a benefit
if I don't have hundreds of containers running on a host? The only concern I
had after reading this was if Fluentd will use as much resources as they
mention. For my use case, I think not.

~~~
whost49
For your use case, I think Fluentd may work fine. LogZoom currently deals with
structured JSON log data received from hundreds of hosts around the world. It
could be modified to handle arbitrary logs (and wrap a structure around it)
and integrate with Docker, but that was not the goal here.

------
otterley
I'm a bit concerned that you're relying on RedisMQ for buffering. Redis is an
in-memory store with optional persistence, but having persistence doesn't make
it a log-structured system like Kafka. You still have to make sure you don't
run out of memory. This greatly limits its ability to buffer messages.

It would have been much better IMHO to utilize an on-disk buffer instead, like
syslog-ng PE does.

------
chuhnk
I am the author of Logslam. Thanks to LogZoom for the credit.

