
A log/event processing pipeline you can't have (2019) - sigil
https://apenwarr.ca/log/20190216
======
fluential
Great article based on real life experience. I have been building logging and
protective monitoring pipelines for a while now.

From my experience if it comes to log shipping from hosts rsyslog + relp +
disk assited in-memory asynchronous queues are preferred, most of the time you
just only have network i/o as logs would not touch disk.

The idea is to ship logs off the device ASAP as well as destination acts as a
sink server capable to handle most of the spikes withouth stressing local
source. All done via rsyslog which also wraps actual logs into json format
locally. The glue could be syslog-tag.

At the other end you could have ELK stack and logstash using json_lines codec
input (pretty fast) structuring data further to your likings.

Just looking into metrics now the avg time for logs showing in ELK is 7-200ms
(the latency comes mostly from specific reads happening against the ES
cluster).

As ELK is always the slowest component, dropping logs compressed in-memory
directly onto disk is also an option.

One thing to note is that RELP can produce extra duplicates which are easily
handled by inserting into Elasticsearch using specific document ID (some
performance penalty) which could be some unique hash computed on (log content,
timestamp, host) etc. With this in place you can also easily "replay" stream
of logs to fill potential gaps.

This type of setup scales really good as well.

Edit: typos

~~~
cbsmith
Except you can scale up and beat that performance significantly if you just
the standard syslog wire protocol to a central log server. That avoids a lot
of independent append operations that you are counting on staying in the OS
buffer despite whatever other IO might be going on, and you're also burning a
lot of CPU (and space) transforming things to JSON using an inefficient JRuby
runtime.

We make this stuff harder by rebuilding functionality on top of systems that
already have the functionality.

~~~
foota
That works fine until your centralized server falls over, either because
you're scaling to hundreds or thousands of machines and your logs are chatty
or someone goofs and starts logging kilobytes of data on every request.

~~~
cbsmith
If you care about that, it's very easy to have redundancy there. In fact, the
target pipe you route to on each container/system/etc. can be fully
virtualized and routed as needed by... syslog. ;-) If you want to go crazy you
can wrap it in rate limiting which is done for you by... syslog.

------
user5994461
Good read, except the part where the author says there are no existing
solutions for processing logs. There are quite a few robust scalable ones.

syslog-ng, logstash or fluentd on the host to collect and aggregate logs.
(logstash/fluentd can parse text messages with regex and handle a hundred
different things like s3/kafka/http but they are much more resource
intensive).

kibana or graylog to centralize logs and search, the storage is elasticsearch.

A simple syslog-ng on the devices could probably do the job. Little known fact
about syslog, it can reliably forward messages over TCP, logs are numbered,
have retries and syslog-ng can do DNS load balancing.

~~~
d4rti
elasticsearch would not cope with that volume of data on the hardware
described.

~~~
user5994461
Well, the article doesn't mention the hardware. Only one machine for 5 TB a
day which doesn't add up as far as I am concerned.

If they were storing everything in S3, it's possible to do something similar
with ElasticSearch for the same order of costs (maybe three times?), the money
going to EBS storage instead. ElasticSearch allows to query and visualize
logs, which plain S3 storage doesn't, it's worth a bit more IMO.

~~~
Spivak
I think the author's math tracks. Our pipeline looks something like

hosts -> rsyslog collector -> kafka -> custom crunching -> elastic

which is a bit inefficient and we try to go hosts -> kafka when we can but a
lot of stuff only supports rsyslog so it's there and simple enough.

The rsyslog collector and our crunchers are teeeny tiny compared to the rest
of the pipeline and can chew through up to a week of backlog in a few hours.
The bottleneck is the network for us and if we upped the pipe to 10G we could
probably get away with a single host.

------
ezekiel68
Loved this. Great, actionable advice that's still applicable over a year later
-- and a true geek's sense of humor. The hidden contrarian in all of us cheers
along with his trials and triumphs. I didn't mind the soft-sell final
paragraph at all, since he gave away the keys to the kingdom in the rest of
the article anyway.

~~~
RasmusWL
I was excited to look at the company, but turns out they pivoted away from
doing log processing :(

~~~
yorwba
It appears to be part of their offering, but as an implementation detail, not
a standalone service: [https://tailscale.com/kb/1011/log-mesh-
traffic](https://tailscale.com/kb/1011/log-mesh-traffic)

~~~
RasmusWL
Awesome, thanks for digging that up :)

------
winrid
Fun read, nice to see some "down to Earth" engineering. :)

------
gbrown_
> So, the pages are still around when the system reboots.

...

> The kernel notices that a previous dmesg buffer is already in that spot in
> RAM (because of a valid signature or checksum or whatever) and decides to
> append to that buffer instead of starting fresh.

This sounds like it should be very unreliable. Perhaps it works in practice
but I couldn't see myself relying on such a mechanism.

~~~
detaro
Why? In a software-triggered warm reset you wouldn't loose data to lack of
refresh, so it doesn't seem that different than recovering from an interrupted
on-disk log or whatever.

------
traceroute66
It was all making for such an interesting read until the last paragraph.

I don't know about anyone else, but I have this inherent hatred of company
marketing material disguised as blog posts.

If you are going to write a decent blog post, then write a decent blog post.
If people are curious about the author they can look them up (and their
affiliation). Don't turn it into a sales pitch.

~~~
MaulingMonkey
While I share the annoyance for stealth sales pitches:

1\. it appears to have been added 2 months after the initial posting rather
than a cynical cash grab.

2\. it appears the resulting company has pivoted to VPN stuff and is no longer
in the log/event processing business anyways?

~~~
tmd83
That's what I was wondering. Tailscale doesn't do logs did it stared with that
:(. I would loved to see there logging solution.

I like the writeup though I need to do a second pass to fully review
everything. The biggest thing that just suddenly hit me was 5 TB/day is really
60MB/s.

