
Distributed Logging Architecture in the Container Era - hkmurakami
https://blog.treasuredata.com/blog/2016/08/03/distributed-logging-architecture-in-the-container-era/
======
rdli
I'm surprised the article didn't mention anything about the nuances of actual
logging in a distributed architecture. In particular, when a request starts to
span multiple containers or microservices, you need a way to correlate all log
data to a given request. You typically do this by propagating a trace ID
through your logging system, and a vector clock-type thing for causal
ordering. If you can't tell which log statements belong to which request and
the order in which they occur, debugging is quite hard :-).

~~~
manyxcxi
I've just started down a path for us and been pretty happy with Loggly so far.

Just by adding a UUID to the requests as they're initially received I'm able
to follow the chain of events across 5 different services, a message queue,
and reconcile with any code exception reports from Bugsnag.

We don't have a lot of unique clients, but we do have a lot of requests.

Loggly can be set to parse logs (or have the data sent directly to it) and
then tag messages with keys/values that can let you watch your request through
every server.

~~~
organsnyder
How are you passing that UUID along the request chain? For instance, when one
microservice calls another, does it have to be manually passed in?

~~~
eddieroger
My web stack checks for a specific header at each potential level of entry,
and in the absence of a value, generates one. Then each layer's logger will
log it back out. It's pretty straightforward for web requests - just use the
plumbing that's already there, headers.

------
im_down_w_otp
The first paragraph closes with a series a incorrect assertions contrasting
containers with VMs, and "distributed logging architecture in the container
era" is apparently nearly identical to the distributed logging architecture of
the VM era in reality, but according to the article is a completely different
thing and mindset altogether... again while looking basically identical.

~~~
virtualpaul
VMs last much longer than containers, no? I can't remember the last time I
spun up a VM that lived for less than a few seconds...

~~~
im_down_w_otp
Mine live for milliseconds, but I'm a weird case where that's intentional as a
security construct.

But that doesn't really have anything to do with the logging architecture. How
long the thing lives doesn't matter when you're using aggregation, bypassing
the local filesystem, and inspecting through a central portal.

People have been doing this in VM environments for a lot longer than they've
been using containers. Whole companies and services were born to facilitate
this, and some have even died already, in the time before containers hit full
stride in the hype cycle.

~~~
scaryclam
Yes, I was left scratching my head as to why containers play any significant
role in the issue of distributed logging. If your choice of container
runtime/vm/servers are creating an issue with logging then there's most likely
an issue with the system architecture rather than actual logging.

We've been building distributed systems (with containers and other approaches)
for years and this is the first time I've heard of logging being a problem.

------
voiper1
This is going to sound dumb, but I started dumping all my logs into mysql so
that I wouldn't also need to run an elastic search cluster. I then query by
the error level, latency, status code to find items of interest. And
everything is stored by some sort of request ID so that I can correlate multi-
entry logs.

It's not pretty -- there's no real-time alerts, but I'm not paying dollars per
gb to AWS and I have no clue what google will end up charging... and I'm
already running mysql. So there is that...

~~~
packetized
I'm guessing your log write volume is relatively low.

------
lobster_johnson
Fluentd is decent (though I rather wished Heka had taken off; it's much more
flexible and in theory leaner and faster since it's Go), but I'm still looking
for a good solution to aggregate, inspect and search logs.

In particular, the ability to live-stream ("tail") logs seems to be a feature
generally missing from logging aggregators. There was one (Loggly?) which
provided a CLI tool for tailing, but it wasn't very impressive. (All the SaaS
apps I have looked at do quite poorly when it comes to rendering live logs in
a browser, too.)

Any recommendations?

~~~
ktamura
>In particular, the ability to live-stream ("tail") logs seems to be a feature
generally missing from logging aggregators.

If you are talking about tailing log files live, Fluentd has supported it from
Day 1:
[http://docs.fluentd.org/articles/in_tail](http://docs.fluentd.org/articles/in_tail)

Also, as other sibling comments mention, there are tools, both SaaS and open
source, that you can use as a destination of the logs Fluentd
tails/listens/collects.

* Elasticsearch: [https://www.digitalocean.com/community/tutorials/elasticsear...](https://www.digitalocean.com/community/tutorials/elasticsearch-fluentd-and-kibana-open-source-log-search-and-visualization)

* Graylog: [http://www.fluentd.org/guides/recipes/graylog2](http://www.fluentd.org/guides/recipes/graylog2)

* Scalyr: [https://github.com/scalyr/scalyr-fluentd](https://github.com/scalyr/scalyr-fluentd)

* Loggly: [https://www.loggly.com/blog/stream-filtering-loggly-fluentd/](https://www.loggly.com/blog/stream-filtering-loggly-fluentd/)

* SumoLogic: [https://gist.github.com/d-smith/8d3e7d53db772c6a7845](https://gist.github.com/d-smith/8d3e7d53db772c6a7845)

* Papertrail: [https://github.com/docebo/fluent-plugin-remote-syslog](https://github.com/docebo/fluent-plugin-remote-syslog)

(and literally hundreds of others)

>though I rather wished Heka had taken off; it's much more flexible and in
theory leaner and faster since it's Go

Heka was a great project, and a drop-in binary (as opposed to requiring a VM
like Ruby) approach was interesting if not compelling in certain situations.
That said, I never saw any benchmark that showed Heka was materially faster
than Fluentd (or Logstash, for that matter). A lot of speed in this type of
complex software comes from data structures/algorithms, an appropriate use of
low-level language bindings, etc.

While language plays a role in the speed of software, it's hardly the only
factor. As you said, it's only _in theory_ , not in practice =)

~~~
lobster_johnson
I meant tailing the aggregated logs themselves. I have looked at the CLI tools
that exist for some of those SaaS services, and I have not found anything half
decent.

For example, you will definitely want to filter on labels (including regexps)
while tailing, and such filtrering should support adjustable context (both #
of lines and time interval) and should support an optional time range to
scroll back into history ("tail from 2pm"). And of course, grepping of
historical entries.

Another thing I am not impressed with is pricing. Loggly seems the most
reasonable in terms of price per volume, except it limits the number of "team
members" to a ridiculous degree (5 users or something like that).

I set up Graylog once and wasn't impressed with it. Its reliance on
Elasticsearch means it is quite static when it comes to the schema/input
format. You can't change the settings; there is no reindexing support (or at
least this was the case when I tried it, a year ago or so). Also, don't think
it has any CLI tools?

------
kozikow
I recently just plugged python logger from
[https://github.com/GoogleCloudPlatform/gcloud-
python#google-...](https://github.com/GoogleCloudPlatform/gcloud-
python#google-stackdriver-logging) into my python codebase. Configuration took
15 minutes, I have nice dashboard showing recent logs, I can log either
strings or jsons. I can analyze data later using BigQuery. It works from
docker containers.

I was contemplating trying out logstash or fluentd, but I don't see any major
advantage over current simple solution. Can anyone more knowledgable than me
help with explaining what I would get from logstash/fluentd?

~~~
pfranz
Does any of your Python processes open a subprocess that then spawns another
Python process as a child (and both gets logged)? If so, have you found any
good way to associate the the parent and child logging messages?

If you just write to stdout everything makes sense in the file top to bottom,
but if you're logging json objects into a database I had problems; if your
parent process echos stdout from the child you lose a lot of context about the
child, you can also double-log child process messages (because the parent is
also echoing them), and you can't easily associate the parent/child objects. I
came up with a hacky solution, but I wasn't logging these events to a
centralized server at the time.

(I haven't used gcloud's dashboard, so I might be making wrong assumptions)

To respond your question: it sounds like you have something that works fine
for logging messages. Personally, I split it into categories 1) logs (serial
events that needed context for any usefulness) and 2) events (metrics or
exceptions). I wrote 1) to traditional log files and wrote 2) to an
ElasticSearch or statsd database to log exceptions to. Ideally, I wanted to
use the same mechanism for both and peel off relevant data into separate
databases.

A metrics database like Elasticsearch will let you query things like, "What
modules give the most errors?" "Has this function been called more often this
week than last?" "Is this process taking longer when using the newly released
version compared to the old one?" etc.

~~~
derefr
Use the process group ID?
([http://unix.stackexchange.com/a/139230](http://unix.stackexchange.com/a/139230))

------
r4um
Inbuilt docker + fluentd driver + plain ES falls flat on its face under scale.
Not to mention docker daemon keeps dying :(.

Whats working for us now is local docker json logs -> heka -> kafka -> graylog
-> ES. We even dockerized graylog to scale the processing up and down on
demand.

Does 250k messages/sec at peak easily, More details in this presentation
[https://www.youtube.com/watch?v=PB8dBnpaP8s](https://www.youtube.com/watch?v=PB8dBnpaP8s)

~~~
vardump
> Does 250k messages/sec at peak easily, More details in this presentation
> [https://www.youtube.com/watch?v=PB8dBnpaP8s](https://www.youtube.com/watch?v=PB8dBnpaP8s)

250k messages/sec is pretty low.

It'd be useful (not to mention entirely feasible) to be able to handle 5M/sec
bursts on a single system and aggregate sustained 50M/s+ (10Gbps network).

The system should be able to establish total order (serialization) locally and
causal order over the network.

I did some tests and found out hardware limits are somewhere between 20-50M
messages per second (serialized) on current consumer grade X86. Practical
implementation would of course be slower.

Of course you need to go binary logging at that point, adhere to cache line
boundaries, etc. mechanical sympathy. Maybe even do usermode networking at the
aggregator server.

Binary logging is a must, because even something like string formatting is
simply way too slow, by an order of magnitude.

In my quick tests I found the string processing hit to be surprisingly high,
10-50x. C "sprintf" and C++ stringstream are atrociously slow. (Surprisingly,
considering sprintf has a pretty complicated "bytecode" format specifier
parsing loop, it was still significantly faster than stringstream
implementation.)

Timestamps are another huge performance issue. Typical system calls for high
precision (a few microseconds or better) timestamp take several _microseconds_
each - using one of those will alone drop performance to 100-500k range. It's
of course much faster to use RDTSC, but then you have the issue that different
CPU sockets have different offsets (sometimes large) and possibly also some
frequency difference between them. Also older CPUs don't have invariant TSC,
so their speed changes when CPU frequency changes.

(Some word of warning about Windows QueryPerformanceCounter: When you test it
on your development laptop, it appears fast, because it's using RDTSC behind
the scenes. But when it runs on a NUMA server, it often changes behavior and
becomes 20-100x slower, because Windows starts to use HPET instead. RDTSC
takes maybe ~10 ns to execute and it's not a shared resource. HPET takes 1-2
microseconds to read and is a _shared system resource_ , concurrent access
from multiple cores will make it slower.)

~~~
r4um
That is the peak we see right now at ~20-30% utilization overall network/cpu
wise, it can do much more easily.

> Binary logging is a must, because even something like string formatting is
> simply way too slow, by an order of magnitude.

true, string processing takes most of the resources.

~~~
vardump
Timestamps are even worse. They're slow to get and have a multitude of other
problems, like timezones, system clock adjustments (manual or NTP), rate
drift, offset errors, etc.

Some solve it by caching a low res timestamp, but that means losing ability to
get side channel information about precisely how long time passed between two
events. High resolution (micro/nanosecond range) timestamps give _so_ much
more information than just time.

~~~
r4um
We haven't dealt with those yet. But in general we are always pushing everyone
to log judicially, use metrics wherever possible (numbers are still easier to
manage than strings), use sentry for exceptions logging and introducing zipkin
for tracing.

------
tveita
The Docker logging model of line-by-line stdout/stderr doesn't seem very
useful beyond debugging in the local console. I prefer configuring the
application logger to ship logs directly to a proper log system when possible.

E.g. Logback is perfectly happy shipping structured logs directly to Kafka or
Elasticsearch, with no need to re-parse the formatted log output.

~~~
jacques_chester
> _The Docker logging model of line-by-line stdout /stderr doesn't seem very
> useful beyond debugging in the local console._

It's a well-understood pattern that Heroku made more visible under "The 12
Factor App".

stdout/stderr require no configuration in any language that I'm aware of. Why
should the app care about how to wire up logging? That's a platform concern.

On Cloud Foundry you get Loggregator, which frankly needs improvement, but for
the most part you don't care about how to wire up logging. You print to stdout
or stderr and the platform wicks that away to a firehose service for you. You
can hook up kafka to spout, or elasticsearch, or anything else you like.

Disclosure: I work at Pivotal, we donate the majority of engineering to Cloud
Foundry.

------
sheraz
I use graylog, and docker can ship logs using the GELF format.

Furthermore, I've got my graylog setup running from a customized docker-
compose.yml file, and this thing sings.

------
bluecmd
They say that without aggregation scale-out is impossible. That's simply not
true. Using things like anycast and ECMP it's super easy as log destinations
are usually stateless.

Also, never had network contention because I logged too much unless something
was really broken - and then only at that node.

I'll continue sending events to my anycast address without any aggregation
thank you very much.

~~~
sagichmal

        > They say that without aggregation scale-out is 
        > impossible. That's simply not true. Using things like 
        > anycast and ECMP it's super easy . . .
    

Then you haven't worked at significant scale. That's fine! Just keep
everything in context :)

~~~
bluecmd
A "probably" there would be nice. I don't want to have to defend myself, but
I'm pretty confident about the scale I've worked on.

I guess it boils down to what you call aggregation.

Is having multiple stateless receivers behind anycast/ECMP aggregation writing
to a distributed database aggregation? I'd argue not, but maybe this is where
the difference in opinion lies.

------
ivan_ah
How does fluentd compare with logstash, the other logging swiss army knife?

~~~
ktamura
There's been some comparisons in the past.

[http://logz.io/blog/fluentd-logstash/](http://logz.io/blog/fluentd-logstash/)

[https://www.pandastrike.com/posts/20150807-fluentd-vs-
logsta...](https://www.pandastrike.com/posts/20150807-fluentd-vs-logstash)

[http://jasonwilder.com/blog/2013/11/19/fluentd-vs-
logstash/](http://jasonwilder.com/blog/2013/11/19/fluentd-vs-logstash/)

------
SixSigma
> But unlike syslogd, Fluentd was built from the ground up to unify log
> sources from microservices, so they can be used efficiently for production
> and analytics.

syslogd(8) -r This option will enable the facility to receive message from the
network using an internet domain socket with the syslog service (see
services(5)).

Why won't this work with dynamic DNS ?

