
Logging can be tricky - hariharan_uno
http://corner.squareup.com/2014/09/logging-can-be-tricky.html
======
mml
Reminds me of the time I sped up the main business app of a large company by
85% by removing "debug" logging. 2tb/hr of "made it here" isn't really useful
at the end of the day.

Not the first time I've seen that by a long shot.

//shakes zimmerframe, shuffles off

~~~
derefr
Thus why your logging library should have a configurable log-level, and the
first thing the user-facing log() function should do is check the arguments
against the log-level and early-return if the message isn't important enough.

(In languages that support macros, you could also just make debug_log() a
macro that gets compiled out on release builds, like assert(). That means a
whole bunch of object files are going to be functionally different between
debug and release builds, though, which is annoying for a lot of operational
reasons.)

My favourite solution by far, though, is the one recently implemented by
Elixir: log() is a regular function, but takes a closure instead of a string.
That way, you can actually put all the work required to _generate_ the log
message (inspecting data structures, doing intermediate lookups, etc.) inside
the closure, and it _all_ gets elided when you decrease the log-level.

~~~
mml
At least in Java, log levels don't buy you much when you wind up possibly
rendering object graphs to strings (via your ohsohelpful custom, recursive
tostring implementation), then duping it to concat "made it here" to it, then
immediately abandon the mess on the heap for the gc to handle, since the
production log level is warn, so your carefully built, highly detailed log
message is never used. Repeat every other line or so... In tight loops, call
iter.next in log message and on and on...

~~~
yummyfajitas
That's not a java problem, that's an "I don't understand Java logging"
problem. The right way:

    
    
        log.debug("I created this gigantic object {}",
                  myBigUglyObject)

------
cryowaffle
Grey text on white background? Please help!

~~~
samstave
You just dont understand sleek, minimal design. How pedestrian.

/s

------
ibar
I had a similar problem with java -- except that the entire application would
freeze for double digit seconds. Another application would sometimes write a
huge amount of data out very quickly to the fs cache. 30 seconds later (or w/e
the expiration is), all those dirty bytes would get sync'd to disk more or
less at once.

Turns out it was the JVM provided GC logging hanging on flush (not even fsync)
calls. The flush call was during GC, and while the GC implementation held a
stop the world lock. Digging through JVM source code is 'fun'.

~~~
lkrubner
I think I have exactly this problem. How do I confirm this? "strace" does not
show me any obvious problems. Where would I look to see if the JVM "flush" was
giving me problems?

~~~
ibar
Well, the easiest way to confirm is to turn off GC logging and see if your
huge "GC pauses" go away. Alternatively, you could turn up the verbosity a bit
- there's an extra flag for details about reference processing. In my case I
was able to track down the exact line in the JVM from the log lines
attributing the huge pauses to whatever trivial component occurred right after
the first flush call.

If you record system metrics (eg. to ganglia) then you can also attempt to
correlate large pauses to a large and rapidly declining number of dirty bytes
in the fs cache.

------
jcampbell1
If all three machines had the same logging code, and one machine was fsync'ing
slowly, isn't turning off fsync just a bandaid that hides the true problem?

When they discover the actual problem is some issue with the raid controller,
I promise not to say "I told you so".

~~~
evan_miller
Hi, post author here. I didn't make it as clear as I could have, but the
difference is that the problematic system had an unrelated process creating
slightly more writes. I sort-of glossed over this with the 5%-20% difference
in i/o util.

Unrelated write activity on a filesystem can cause cause fsync() calls in any
other process to vary wildly in latency. This can be replicated, here's an
experiment for you. First, run this:

strace -T -efsync ruby -e'loop { STDOUT.fsync; puts "a" * 120; sleep 0.1 } ' >
~/somefile

Then, in another terminal do a little bit of writing -- make sure it is on the
same filesystem. For example:

dd if=/dev/zero of=~/someotherfile bs=4M count=1

On my poor little aws VM, here is what I see:

fsync(1) = 0 <0.025072>

fsync(1) = 0 <3.930661>

fsync(1) = 0 <0.024810>

That is, writing 4 megabytes in an unrelated process caused fsync() to jump
two orders of magnitude.

Removing fsync() is an appropriate fix because we don't really ever want to
flush this data to durable storage.

~~~
jcampbell1
I probably read "an otherwise uniform cluster" too literally.

20% utilization on a resource that can only do 100 random operations per
second is a major problem, and way different from 5%.

A piece of hardware that operates at 100 Hz, with 20% utilization, will block
for about 135ms+ for the 95th percentile request. With 5% utilization, it will
block for about 10ms for the 95% percentile request.

My quick calculations are somewhat below the 200ms discrepancy you show in the
chart, but not far off.

Of course, turning off fsync is a perfectly good solution. Longer term, I
would move to SSDs and just make this entire class of problem go away. I don't
even have any spinning platters to test your strace on.

------
tendeer
While I really like the article and its analysis, I think there is a more
general point here - scale is tricky.

Most programming abstractions and tools (such as Logging / UI design /
persistence) across instances are all "easy" when it comes to one machine (or
a few machines) in a closed, deterministic environment.

All of these become extremely tricky when you have to deal with many instances
across many different possibly indeterministic environments. For UI, it
quickly becomes what's the viewport? What's the real estate? How is
interaction (touch? mouse? keyboard? numpad?). For persistence, one has to
worry about number of simultaneous connections, character encoding and
normalization, read / writes per second, synchronisation, etc..

So, also for logging - Are we logging client side or over the network? what's
the memory on the client? What else is running on the client? What's the
permissions through which we log? Do we care about a system call on this
client given any user interaction? etc..

Logging, UI, Persistence, Network protocols, heck even just choosing an
intermediate data structure are all tricky .. at scale across devices and
indeterministic environments.

------
dclusin
Compressing the text logs prior to writing them to the disk also helps with
these kinds of issues. You can also offload your logging to a dedicated thread
and then use a lock free queue to increase your performance even more.

~~~
derefr
> You can also offload your logging to a dedicated thread and then use a lock
> free queue to increase your performance even more.

Or use syslog() like a sane person, and let that "extra thread" live inside
the OS IPC mechanism.

(Or stdout/stderr like a _modern_ sane person, and let
upstart/systemd/docker/etc. push your logs to syslog if that's where it feels
like pushing them.)

~~~
heavenlyhash
The OS IPC mechanism quite likely does not use a lock free queue -- or at
least, not in quite the same way as I think the grandparent post refers to.

Using a well implemented ring buffer [+] can get enqueue operations down to a
few instructions and something like two memory fences.

The overhead of IPC, which wakes up the kernel scheduler, switches the
processor back and forth between privilege modes a few times on the way,
knocks apart all the CPU cache and register state to swap in another process,
while the MMU is flipping all your pages around because these two processes
don't trust each other to write directly into their respective memory... is
not going to have quite the same performance characteristics.

An moment in the history of logging is java's log4j framework, which, within a
single process, used exclusive synchronization. When this was replaced by a
(relatively) lock-free queue implementation, throughput increased by orders of
magnitude. (Their notes and graphs on this can be found at
[https://logging.apache.org/log4j/2.x/manual/async.html](https://logging.apache.org/log4j/2.x/manual/async.html)
.) This isn't an exact metaphor for the difference between a good lockfree
ringbuffer and IPC either, but it certainly has some similarities, and indeed
ends with a specific shout-out to the power of avoiding "locks requiring
kernel arbitration".

\--

[+] The "mechanical sympathy" / Disruptor folks have some great and accessible
writeups on how they addressed the finer points of high performance shared
memory message passing. [http://mechanitis.blogspot.com/2011/06/dissecting-
disruptor-...](http://mechanitis.blogspot.com/2011/06/dissecting-disruptor-
whats-so-special.html) is one of my favorite reads.

~~~
derefr
I think you're thinking of a synchronous RPC mechanism. I was talking about
IPC mechanisms like unix domain sockets, where sending the message doesn't
interact at all with the receiving process, but literally just sticks it into
a buffer, where the other process can come and get it later.

------
erikb
Evan, I would be happy if you could explain more about what you see in the
strace outputs. E.g.:

> Time spent in fsync accounts for almost exactly the delay we were seeing.

What delay? I see the whole thing taking 1.5 seconds and 1.3 seconds spent in
futex (0.4 more than on the normal host). Not sure, why we are suddenly
talking about fsync. I also don't know what either method (futex, fsync) could
be doing.

All these are not questions I want answers to (some stuff I could google of
course). I just want to show that it's a rather confusing read for some
readers if you expect them to understand the strace outputs as well as you do,
when you seem to be using that tool on a daily basis and the readers might not
have used it at all, ever. It would be great to follow your insides better.
Just small additions like the following would help a lot: "[The X seconds]
spent in fsync [seen in diagram A] accounts for almost exactly the delay we
were seeing [in diagram B]".

~~~
zorbo
> Just small additions like the following would help a lot: "[The X seconds]
> spent in fsync [seen in diagram A] accounts for almost exactly the delay we
> were seeing [in diagram B]".

Exactly. I try to always point to the actual data when I'm telling my readers
my conclusions.

"Calls to futex, epoll_wait and select are largely the same on both systems.
The problem host does 1727 calls to fsync(), while the normal host does only
849 calls. The problem host spends 1.24% of its time in fsync(). The normal
host only 0.48%. This makes fsync() our number 1 suspect."

This at least lets the reader refer back to the output and see corresponding
entries.

------
sargun
Why would you fsync logs for a high-level service? Are you afraid a power
outage is going to cause you to lose service logs?

~~~
TheLoneWolfling
Making sure your logging data gets written is relatively important.

Say, if you hit a bug somewhere that causes the server itself to crash.

~~~
frankzinger
How about installing a SIGSEGV, etc. handler to do the fsync (and perhaps even
print a stacktrace)?

~~~
Rapzid
Disabling buffering in the file library should provide for a similar
mitigation outlook. If the OS accepts the writes it WILL make it to disk
unless the some part of the storage sub system(or the os itself) fails. So if
the process dies recent writes should make it to disk regardless. But if the
process triggers a complete system crash then you're kinda stuck needing
fsync's :|

~~~
frankzinger
> Disabling buffering in the _file_ library should provide for a similar
> mitigation outlook

Did you mean _logging_ library? I.e., application-level buffering? In that
case, good point.

------
govindkabra31
Did anyone understand why logging calls on only problem machine had different
fsync behavior than normal machines?

~~~
evan_miller
Hi, post author here. The slight difference in i/o (specifically: writes) was
the trigger. I talked a little more about that here:
[http://corner.squareup.com/2014/09/logging-can-be-
tricky.htm...](http://corner.squareup.com/2014/09/logging-can-be-
tricky.html#comment-1603001511)

And here:
[https://news.ycombinator.com/item?id=8359556](https://news.ycombinator.com/item?id=8359556)

Even on the problematic host we only saw this latency issue in the 99th
percentile. That is: even on the problem host 99 out of 100 queries were
served as expected and only 1 out of 100 saw this additional latency.

~~~
unclebucknasty
Well, yeah, I noticed you guys' response to one of the comments on the blog
post indicated that the problem machine had a different workload (additional
tasks or something). That caused the additional writes, which then caused the
latency for the main app on the box.

I think your point still stands about logging, being cautious about blocking
I/O calls, etc. But, it seems the bigger point is one of how your overall
system is architected, which proccesses run where, dedicating like nodes to
their tasks vs. potential quality/consistency issues arising from having some
pull double-duty, etc.

Those seemed to be the source of the real issue here.

~~~
evan_miller
Sort of. The catch is that even a very small write, say just a few megabytes,
can drastically change the cost of an fsync(). On my test aws VM even writing
just 4 megabytes one time is enough to trigger the problem. Even on an
otherwise fully isolated system a few megs may be written from time to time,
for example by a management agent like chef or puppet. Or by an application
deploy copying out new binaries.

For example, here I reproduce the problem on a completely isolated machine:
[https://news.ycombinator.com/item?id=8359556](https://news.ycombinator.com/item?id=8359556)

~~~
jamesaguilar
IMO the real issue is that a competent logging framework doesn't block app
code to sync the log to disk. The buffer should be swapped out under lock, and
then synced in a separate thread. Yuck.

~~~
shabble
The downside is of course that if you crash hard, the most valuable log
entries are the ones least likely to be on-disk afterwards.

~~~
snuxoll
Which is why logging to disk on the server is BAD, have your log framework
write to stdout and have upstart/systemd/whatever handle writing to a remote
syslog server or whatever your fancy is.

------
kordless
If you want tricky, try building a logging service and then enable logging to
itself.

~~~
vacri
I once enable logging on an AWS S3 bucket... which was our 'logs' bucket...
which was the stated destination for the bucket logging. It wasn't a
particularly costly error, but there was a lot of logspam...

------
nkozyra
Does anyone think about doing logging to shared memory / memcached and then
committing snapshots to disk at regular intervals via another process/machine?

If you're not all that concerned about consistency, each web server can keep
their logs in a segregated memory space and then another process can
combine/commit and send a flush command, leaving the primary machines
relatively unencumbered.

~~~
evan_miller
With fsync() taken out of the equation this is essentially how a naive logger
operates. The linux page cache plays the role of shared memory and only under
rather heavy contention will a write() incur latency.

The trouble outlined in my blog post is that the logging framework was calling
fsync() -- that is, specifically asking to flush the page cache all the way to
disk.

~~~
jacques_chester
Have you experimented with a log-structured filesystem? I once saw similar
high-variation behaviour in a logging problem on extfs. Switching to NILFS
pretty much got rid of variable latency entirely.

~~~
evan_miller
Not recently, no. I do know that a journaled filesystem can exacerbate this
sort of problem as it can make extra work. For example:
[http://lwn.net/Articles/328363/](http://lwn.net/Articles/328363/)

In a few cases in the past when dealing with unimportant data I have
downgraded to ext2 for a nice performance bump.

~~~
jacques_chester
Just to make sure we're talking about the same thing, a journalling FS isn't
the same thing as a log-structured FS.

The first has a write-ahead log, the latter is basically just a log. So
immediately writing to disk is relatively simple.

~~~
evan_miller
Yes. I haven't done anything with a log-structured FS. I have only played
around with conventional filesystems.

------
digikata
Doe this increase the chance that you lose a bit of critical log if a fault
causes the system to go down?

~~~
jewel
Yes, but if the logs are just for debugging an application, it usually doesn't
matter. Application logs are useful for figuring out why an application
crashed, but shouldn't be relevant for figuring out why a system went down.
However, if the app crashes while some log information is still in the system
buffers, the system will still write them out, even if fsync() is never
called.

It's been a few years since I looked at this, but it seems like there's only a
small window that data will stay in the buffer before it will get
automatically flushed. The nice thing about this automatic flush is that it
won't cause the application to block.

~~~
viraptor
Also if your logging plugin / framework / module allows logging to remote
syslog, that's a pretty good idea - no disk blocking. Although you are then
risking data loss when the network goes down...

------
canterburry
Wow...this post made me think I shouldn't even attempt to run my own server
infrastructure for my startup. This kind of analysis is way deeper than I'm
currently capable of.

~~~
rjbwork
Hell, we're now infrastructure-less. No VM's (except for our QA to test on,
doesn't count), no static servers we can't just destroy and spin up another of
in <5 mins, and no database servers that we have to manage.

It truly is a joy.

~~~
oftenwrong
Curious: How do you persist data with no database servers? Using an outside
service to store it?

~~~
rjbwork
We used Azure SQL for our RDBMS requirements, and blob/table storage for our
NoSQL stuff. Sure there's a server behind it, but it's transparent to us and
it's geo-redundant and replicated. When i say "no servers" what I mean is, we
don't need any system admins and very little ops folks. In fact at the moment,
devs are pretty much ops since we need very little of it.

I also forgot to mention we have a build server, but thats just a virtual
machine that we backup regularly. It's our only piece of static
infrastructure. Everything else, except storage which is, again, transparent
to us, is destroyed on a weekly, if not daily, basis for new releases.

------
mleonhard
I think this problem is mostly due to using Go, whose libraries are not
mature.

