Hacker News new | past | comments | ask | show | jobs | submit login
A log/event processing pipeline you can't have (2019) (apenwarr.ca)
162 points by sigil on Aug 26, 2020 | hide | past | favorite | 40 comments



Great article based on real life experience. I have been building logging and protective monitoring pipelines for a while now.

From my experience if it comes to log shipping from hosts rsyslog + relp + disk assited in-memory asynchronous queues are preferred, most of the time you just only have network i/o as logs would not touch disk.

The idea is to ship logs off the device ASAP as well as destination acts as a sink server capable to handle most of the spikes withouth stressing local source. All done via rsyslog which also wraps actual logs into json format locally. The glue could be syslog-tag.

At the other end you could have ELK stack and logstash using json_lines codec input (pretty fast) structuring data further to your likings.

Just looking into metrics now the avg time for logs showing in ELK is 7-200ms (the latency comes mostly from specific reads happening against the ES cluster).

As ELK is always the slowest component, dropping logs compressed in-memory directly onto disk is also an option.

One thing to note is that RELP can produce extra duplicates which are easily handled by inserting into Elasticsearch using specific document ID (some performance penalty) which could be some unique hash computed on (log content, timestamp, host) etc. With this in place you can also easily "replay" stream of logs to fill potential gaps.

This type of setup scales really good as well.

Edit: typos


Except you can scale up and beat that performance significantly if you just the standard syslog wire protocol to a central log server. That avoids a lot of independent append operations that you are counting on staying in the OS buffer despite whatever other IO might be going on, and you're also burning a lot of CPU (and space) transforming things to JSON using an inefficient JRuby runtime.

We make this stuff harder by rebuilding functionality on top of systems that already have the functionality.


RELP is very much standard syslog wire protocol(which is just plain tcp anyway), but with some extra stuff to avoid losing data. Just fixing tcp issue here, where data could be ACK-ed into OS receive buffer and the app could be killed before offloading into a persisten storage. Not losing data is very important in log shipping.

The json that syslog wraps data in is actually using the same syslog protocol - nothing is chaging here. It's encapsulated within the same protocol. The only reason to unwrap json is when you need to start structuring your data for BI views, your milage may vary. The key is that these two are de-coupled.

Because you work on a basic level networking / storage layers its very easy to reason and build reliable pipelines according to your needs. As you say — you can't beat that.

PS: The pain starts when you need to deal with multiline logs like java stack traces but then you just work with devs to log everything into json.

Edit: split comment, more info


That works fine until your centralized server falls over, either because you're scaling to hundreds or thousands of machines and your logs are chatty or someone goofs and starts logging kilobytes of data on every request.


If you care about that, it's very easy to have redundancy there. In fact, the target pipe you route to on each container/system/etc. can be fully virtualized and routed as needed by... syslog. ;-) If you want to go crazy you can wrap it in rate limiting which is done for you by... syslog.


There would be usually multiple syslog sink servers so there is never a single point of failure. This can be configured in various ways either dns round robin, load balancers or even multiple destinations configured on the client side, rsyslog can iterate over a list in case of failures.

If there is log storm coming from misconfigured app depending on your traffic levels the sink servers should give you a lot of space for an on-call engineer to be notified that something is off and adjust / rate limit / drop accordingly.

Edit: udated with more info


We have a similar setup, but I've been generating UUIDs for each event at logging time to avoid any computational overhead for deduping.


The way to approach that in logstash is to use fingerprint plugin which can do various formats (including uuids), MURMUR3 seems to be the fastest here.

https://www.elastic.co/guide/en/logstash/current/plugins-fil....

Edit: typo


Unless they are type1 uuid's, there's probably just as much computational overhead going on in the uuid creation. ;-)


Well, in our case, the event generators are almost never CPU bound but the logstash nodes are. Generating a UUID (which is mostly just waiting for the kernel to hand you some bytes, which allows other threads time to do useful work) before sending has no opportunity cost and frees up the more-valuable logstash resources.

(These are micro-optimizations anyway.)


Yeah... I'd definitely remove logstash from that pipeline for that reason.


Good read, except the part where the author says there are no existing solutions for processing logs. There are quite a few robust scalable ones.

syslog-ng, logstash or fluentd on the host to collect and aggregate logs. (logstash/fluentd can parse text messages with regex and handle a hundred different things like s3/kafka/http but they are much more resource intensive).

kibana or graylog to centralize logs and search, the storage is elasticsearch.

A simple syslog-ng on the devices could probably do the job. Little known fact about syslog, it can reliably forward messages over TCP, logs are numbered, have retries and syslog-ng can do DNS load balancing.


Yeah, fluentd... I had to look into its in_stdin plugin, and I can tell you one thing: it simply doesn't handle regexes properly. In fact, none of its input plugins can correctly handle the situation where a multiline regex doesn't match the whole line, unless they do what in_tail did (split the lines yourself and disallow the multiline regexes).

Edit: Sorry, I meant "Fluent bit". No idea how fluentd handle this scenario, but I was told it was too slow (being written in Ruby) so that's why the switch to fluentbit was made.


Last time I tried, fluent-bit did not support IPv6. Maybe it does now? (Maybe it does not: https://github.com/fluent/fluent-bit/issues/2056 has been opened recently). Anyway, compared to the solution in the article, it seemed way too fanciful.

Fluentd would not just be slow, but also run out of memory. I am no ruby-head, but a then-colleague of mine helped configure it. It still ran out of memory. I had no patience for a log system that did not work out of the box on one computer basically just logging failed ssh login attempts.


I've done multiline processing with logstash. It works but it's not necessarily efficient or simple. Depends a lot on the pattern and whether the multiline stack traces you want to extract turns out to be 700 lines long sometimes (yes, speaking from experience).

I'd personally recommend to not bother with processing multiline output into a single message. Lots of trouble for no benefits. It's just a stream of lines at the end of the day, it will look the same in tail and kibana.


Checkout Vector. vector.dev



The flexibility of Fluentd and the community around it are great, but ultimately the resource intensity proved to be too high and too unpredictable for our use case.

We ended up building our own general purpose log parser in Go to support our hosted log monitoring platform. Though it's still somewhat young, performance is great (beats Fluent Bit in most of our benchmarks), and it's nearly as flexible as Fluentd in terms of configuration. If you want to check it out, we recently open-sourced it and are always looking for more feedback: https://github.com/observIQ/stanza


There's also good ol' dirty syslog.


elasticsearch would not cope with that volume of data on the hardware described.


Well there is clickhouse. And projects like this https://github.com/flant/loghouse


Well, the article doesn't mention the hardware. Only one machine for 5 TB a day which doesn't add up as far as I am concerned.

If they were storing everything in S3, it's possible to do something similar with ElasticSearch for the same order of costs (maybe three times?), the money going to EBS storage instead. ElasticSearch allows to query and visualize logs, which plain S3 storage doesn't, it's worth a bit more IMO.


I think the author's math tracks. Our pipeline looks something like

hosts -> rsyslog collector -> kafka -> custom crunching -> elastic

which is a bit inefficient and we try to go hosts -> kafka when we can but a lot of stuff only supports rsyslog so it's there and simple enough.

The rsyslog collector and our crunchers are teeeny tiny compared to the rest of the pipeline and can chew through up to a week of backlog in a few hours. The bottleneck is the network for us and if we upped the pipe to 10G we could probably get away with a single host.


I missed where a volume of data was described that was outside elastic search's capacity. Seems unlikely that it "would not cope" regardless of configuration.


Loved this. Great, actionable advice that's still applicable over a year later -- and a true geek's sense of humor. The hidden contrarian in all of us cheers along with his trials and triumphs. I didn't mind the soft-sell final paragraph at all, since he gave away the keys to the kingdom in the rest of the article anyway.


I was excited to look at the company, but turns out they pivoted away from doing log processing :(


It appears to be part of their offering, but as an implementation detail, not a standalone service: https://tailscale.com/kb/1011/log-mesh-traffic


Awesome, thanks for digging that up :)


Fun read, nice to see some "down to Earth" engineering. :)


> So, the pages are still around when the system reboots.

...

> The kernel notices that a previous dmesg buffer is already in that spot in RAM (because of a valid signature or checksum or whatever) and decides to append to that buffer instead of starting fresh.

This sounds like it should be very unreliable. Perhaps it works in practice but I couldn't see myself relying on such a mechanism.


Why? In a software-triggered warm reset you wouldn't loose data to lack of refresh, so it doesn't seem that different than recovering from an interrupted on-disk log or whatever.


I've worked on a system that relied on it to do fast upgrade reboots and it worked. I'll have no problem relying on this again.


It was all making for such an interesting read until the last paragraph.

I don't know about anyone else, but I have this inherent hatred of company marketing material disguised as blog posts.

If you are going to write a decent blog post, then write a decent blog post. If people are curious about the author they can look them up (and their affiliation). Don't turn it into a sales pitch.


[Author here] As others have pointed out, it wasn't originally a sales pitch and we later pivoted to something else. I agree it came across in a kind of annoying way.

I've added a new update at the end to describe what happened, for anyone who is curious.


Thank you. Good luck with Tailscale! Very cool service.


He wrote the post on 2019-02-16 and didn't put in the edit until 2019-04-26 so the causality probably happened the other way around.

It seemed like this is a career summary from his time at Google Fiber (which was killed by google).

It sucks when your baby gets killed for reasons outside of your control. You want those years of your life to mean something, for the work to live on in some form (hence the "Please, please, steal these ideas").

He probably realized after publishing that the post wasn't enough to give him closure so he started a company to keep working on it instead.

(I don't know why I am psychoanalyzing a random blog. I am probably projecting really hard.)


Fiber was killed? Google won’t stop hassling us about getting it here in Austin, TX!


While I share the annoyance for stealth sales pitches:

1. it appears to have been added 2 months after the initial posting rather than a cynical cash grab.

2. it appears the resulting company has pivoted to VPN stuff and is no longer in the log/event processing business anyways?


That's what I was wondering. Tailscale doesn't do logs did it stared with that :(. I would loved to see there logging solution.

I like the writeup though I need to do a second pass to fully review everything. The biggest thing that just suddenly hit me was 5 TB/day is really 60MB/s.


The Company's mission statement is "Simplifying the long tail of software development". Perhaps networking is only of the possibly many building blocks they plan to tackle to simplify software development for the normal companies.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: