
Eliminating Delays from systemd-journald, Part 1 - philips
https://coreos.com/blog/eliminating-journald-delays-part-1.html
======
falcolas
> In more extreme scenarios, if journald is blocked long enough for its
> watchdog timer to expire, it will be killed with SIGABRT, dump core, and
> abruptly close all connected sockets losing whatever log data was in-flight.
> In some cases this can even cause services to crash due to the unexpected
> hangup of their stdout and/or stderr.

Ouch.

> we were seeing a delay between .1-.5 seconds twice every 5 seconds

Double ouch. Half a second can be an eternity when wringing all the
performance you can out of a system, and with so many cloud providers using
network attached storage, that half second could balloon so quickly.

200kb is not an unusual size for unix socket buffers, and it wouldn't be too
hard to bypass 200k in log writes in half a second on a busy system.

As an org who will be moving onto systemd with the release of the next Ubuntu
LTS version, I appreciate what CoreOS is doing here. Great article, great
finds.

As a side note, this is something to look out for on AWS instances where the
default for /var/log is on an EBS mount (which can see delays due to network
latency). It would make sense to mount /var/log to the ephemeral drive to
avoid or minimize such unpleasantness. Of course, that limits the ability to
use /var/log to troubleshoot kernel panics.

~~~
otterley
Durability and performance is a classic tradeoff, and systemd-journald is
hardly the only application affected by this. If you need more performance
while maintaining durability, buy faster storage technology.

That said, if journald could use multiple I/O threads, it might be able to
take advantage of concurrency in the underlying storage if available (e.g.
RAID). MySQL's InnoDB is a good example of such an implementation.

At any rate, in a modern production system, you generally want logs to be
copied to an indexed archive located elsewhere in the network. On-host logs
should be considered ephemeral, and mainly used as a buffer against slow
consumers.

~~~
falcolas
I think the bigger problem is more related to how outside services are being
negativiely impacted (up to and including crashing) by delays in writing to
disk, and have no choice in the performance/safety trade off.

~~~
otterley
Any program that uses the syslog(3) library function will also have this
problem, as it also blocks (internally it writes to the /dev/log socket file).
If the consumer can't keep up, whether it be rsyslogd or syslog-ng or whatever
is listening to the socket, the producer can block as well.

So this is not a new problem. If you really want your producer to drop logs in
the event of backpressure, you can directly use the syslog protocol over UDP,
but it has tradeoffs too.

~~~
falcolas
That depends on the syslogd implementation, and will not cause programs to
crash due to disconnected stderr/stdout buffers.

Perhaps more importantly, you can swap out the syslogd if you don't like its
implementation. This is harder to do (if even possible) with journald.

~~~
otterley
Have you seen any syslog server implementations that are significantly more
performant than journald?

------
journaldmann
Great discussion, but one issue I've run into journald, compared to a
traditional logging mechanism, is the size of the journal itself.

journald's archives are generally a factor of at least 200x of syslog for the
same amount of information stored.

That is, on two identical servers, I've run rsyslog with the traditional
logrotate mechanism and journald side-by-side to see how they compared.

I'm pleased with the extra querying mechanism provided by journald, which is
absolutely great for debugging, however while I can store the logs one server
virtually forever with just bz2 compression, I've unexpectedly run out of
space in my var partition with journald after just a couple of months. The
extra bandwidth is also a factor in the performance of journald, which is
equally ignored.

In fact, this has forced me to reconsider the storage logging policy. It's
much more effective to use rsyslog to send encrypted logs through the network
(faster, less latency and less bandwidth) compared to what it would be with
journald. I still keep the journal on individual machines for easier querying,
but with a very limited retention.

If I have to chose between a log I can query easily, and a log file that can
have 200x the density, I'll easily pick the second.

I've also found that the resilience of the journald archives are much lower. A
corrupted text log file (due to a crash) has nothing to recover and can be
easily read to inspect issues in any worst-case scenario. journald archives
however are often corrupted in crashes and moved away after the reboot. For
remote logging it might not be an issue, but for embedded systems and/or
desktop machines it _is_ an issue. Ensuring logs can be read even when
corrupted is easily much more important than introducing the signing mechanism
in the archives (which, on the other hand, is totally a non-issue as sensitive
hosts should do remote logging).

~~~
ape4
I can't help but think... if they just used traditional log files there would
be no delays, corruption.

~~~
pilif
I don't think this is entirely true: Even if you're appending lines to a text-
file, if you do it from multiple concurrent threads, you're going to
potentially corrupt individual lines.

And if you want to make absolutely sure that the log line you've just written
is actually safely on disk, then you still have to fsync().

I would argue that journald's mmap-with-periodic-fsync() approach is actually
better for latency.

If you want to make sure that your system performance isn't affected by
stalled logging i/o, then you should ship the logs off-machine using a non-
blocking protocol. Of course then you lose guarantees that your log-entries
have actually been stored (which might or might not be a problem for you).

~~~
otterley
> Even if you're appending lines to a text-file, if you do it from multiple
> concurrent threads, you're going to potentially corrupt individual lines.

Not true; the write(2) syscall is atomic. You just have put the the entire
line in the buffer you pass to it.

~~~
lomnakkus
Not true AFAICT. (On Linux at least)

Here are a couple of snippets from the man page:

> If a write() requests that more bytes be written than there is room for (for
> example, the file size limit of the process or the physical end of a
> medium), only as many bytes as there is room for shall be written.

(Admittely disk full is pretty much an edge case.)

> If write() is interrupted by a signal after it successfully writes some
> data, it shall return the number of bytes written.

~~~
Someone
I think a slightly better way to describe this is

 _" write() is atomic, but may not write the number of bytes asked to write."_

A possible workaround is to use a start of line marker and provide a way to
detect broken lines.

For example, if one starts every line with a 0x02 and ends it in a 0x03 (Start
of Text/End of Text ASCII characters) and removes any such character from the
text being logged, partly written log entries can be detected and ignored when
reading the log. If one does that, one can just retry writing the entire line
until write() indicates it wrote the entire buffer passed in.

~~~
mjevans
I disagree. It is far better to use a String like approach and use a structure
that includes the length of the next written object.

I believe that a simple sacrifice of a little more storage would also make
that easier to split out. Pre-pad the message up to pad size zeros and then a
pseudo-random number of additional pad objects (1 to n). Next you would store
either the offset of the start or end, as well as the length of the payload
(start would probably be easier to verify). With that structure it should be
difficult to craft a payload message that looked like a valid message for the
log.

~~~
Someone
I don't quite understand what approach you are advocating, but for me _" it
should be difficult to craft a payload message that looked like a valid
message for the log."_ is not good enough for a system service.

~~~
mjevans
This is a back of the napkin grade design; refine it and/or develop a proof
for your specific use case.

Can you see any way it would be obviously incorrect? I could not when I wrote
it, but I did not take the much greater effort of /proving/ it was a correct
design. (which much like crypto algorithms, should also be done on an
implementation by implementation basis)

------
rkangel
This is one of those incredibly clearly written technical articles by someone
who really knows their stuff that brings together a few things I understood
vaguely into a coherent understanding. Lines like:

> There’s a per-kernel sync(), per-filesystem syncfs(), per-file fsync() and
> fdatasync(), per-mmap msync(), and per-file-range sync_file_range().

Incidentally, the CoreOS has one of the best home pages that I've ever
encountered. "These are our main products, and for each of them here's an
information icon that you can click that will actually inform you what they do
and what they're for".

I would like to hire whoever wrote those descriptions too - clear concise
explanation of what things do and why you'd want them.

------
mtanski
This is a fantastic analysis and writeup of the problem.

Is there a practical reason that system-journald doesn't do O_DIRECT and non-
blocking (using io_submit family of syscalls) writes/fsync? If minimizing
blocking is such an important issue then given the state of disk IO in Linux
that your only solution.

As long as you pre-allocated the space and are not required to do metadata
updates you will get mostly non-blocking operations. Since this is done for
logging which I'm presuming is not going to be read right back there's no
downside of using O_DIRECT, in fact not polluting the page cache might be a
net benefit.

Btw, is there anyway to get in touch with the author to talk about the issue?
Can't find his contact details.

------
JdeBP
Duplicates
[https://news.ycombinator.com/item?id=11268059](https://news.ycombinator.com/item?id=11268059)

~~~
simoncion
Jesus. From the first bug mentioned in your comment in that thread:

Reporter: "libsystemd-journal or systemd-journald loses some systemd's parts
of message context when few messages logged in a moment"

Poettering: "This is a race. ... To fix this we need [two new things.] ...
These items are listed on the plumbers wishlist, because we need [those things
mentioned previously] from the kernel folks."

~~~
andor
_Software Development Jesus descends from the clouds, wearing a virtual
reality headset and a t-shirt with the words "freedesktop.org". Something
about him reminds you of Bill Gates, but you are not sure what, the headset
with its peculiar red laser pointer attached to the side covers too much of
his face. He opens his right palm towards your computer, which surprisingly
seems to be running Fedora Linux now, and through the magic of Zeroconf
networking and D-Bus, the following text appears on your screen:_

Happy to explain this to you, simoncion!

After receiving a log message from a client process, the journal tries to get
_trusted_ metadata, like the application name, about that process. If the
application exits too soon after logging its message, some of that metadata,
e.g. /proc/<pid>/cmdline, might not be available anymore. The message itself
is always logged, though.

Syslog, of course, also logs application names, but that information cannot be
trusted, because it is reported by the clients themselves. So even if not all
expected metadata is logged, you lose nothing compared to syslog.

------
emeraldd
I have to say that the style of that code bugs me ... Not using { } on your
control structures is just asking for nasty non-intuitive bugs ...

