
How To Write Good Log Messages - trotter_cashion
http://www.trottercashion.com/2012/08/12/how-to-write-good-log-messages.html
======
shazow

      "08-08-2012-20:27:59"
    

Which 08 is the day or the month?

    
    
      cut -d ' ' -f2- foo.log | sort
    

Great, now it's sorted by (month, day, year) or (day, month, year). If my logs
span multiple years, the sorting is useless.

Please use YYYY-MM-DD. It's sortable and more intuitive for everyone outside
of the USA. See also <http://en.wikipedia.org/wiki/ISO_8601>

~~~
zdw
To go one step further...

RFC3339 gives a more tightly defined subset of ISO8601, which is even more
useful and lacking things like "week of year" which have little use on modern
systems:

<http://tools.ietf.org/html/rfc3339>

------
gwillen
This may be a bit out of scope for this post, but at Google I used to get a
lot of mileage out of a system I worked on called Dapper:

[http://static.googleusercontent.com/external_content/untrust...](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36356.pdf)

It uses thread-local storage and RPC library instrumentation to thread a fixed
request-id through every RPC call and thread handling a particular user
request.

Then you can do things like automatically adding the request ID to log
messages, transparently to the request-handling code. And when you visualize
logged RPCs, you can include (some subset of) the logged messages in your
visualization.

~~~
emmelaich
Dapper looks fantastic. It reminded me somewhat of the ancient ARM (1996)
[http://en.wikipedia.org/wiki/Application_Response_Measuremen...](http://en.wikipedia.org/wiki/Application_Response_Measurement)
which I thought had died.

I was pleasantly surprised to see that there's a module for Apache in the
works.

~~~
gwillen
Wow, that's cool. The two systems are very similar, and I wouldn't be shocked
if ARM was the inspiration for Dapper.

------
3amOpsGuy
One of the hardest things in ops is trying to see what the app is doing
without benefit of the code to review.

I know what the app does, I know the process flows involved but 'failed to
covert X at MyClass:44' doesn't have enough context for me.

Also log levels - most developers over use warning and error levels and under
uses debug and trace in my experience. It makes filtering out the noise
harder.

Rolling log files - roll on an event (new day, process bounced etc.) not on
reaching a 10Mb limit or an equally arbitrary 1k lines limit.

Use a header in the logfile, contain at least the version number of the code
running, the user and host it is running on.

Lastly stack traces, if they even have a use case at all in production, should
not be in the main logfile.

~~~
moe
_roll on an event (new day, process bounced etc.) not on reaching a 10Mb
limit_

Only applicable when you produce a small, predictable volume of logging. Not a
good general guideline because often you don't know how big your system may
grow in the future.

When logging high volumes then fixed chunk sizes are usually preferable (e.g.
so you know how many you can keep around without a disk running full). Large
sites produce multiple gigabytes of logs per hour.

 _Use a header in the logfile_

Disagree. Context should be provided directly in each individual line. Logs
get sliced and diced in analysis, fetching an arbitrary header from the
_right_ file for a given context can be rather painful.

Don't be afraid of long lines. Use field delimited formats (TSV) so they can
be easily parsed (either via 'cut' or a small script).

 _stack traces ... should not be in the main logfile_

Wrong. When a serious error occurs then as much context as feasible must be
provided. Stack traces are usually very feasible and indispensable.

~~~
3amOpsGuy
>> Only applicable when you produce a small, predictable volume of logging.

No, It's primarily in the larger log volume systems we run into this problem.
Take a FIX session adapter (very common software in the financial services
world), baseline is around 5Gb/hour but its susceptible to fairly extreme
fluctuation.

When you roll the logs at say 500Mb, there's the chance our monitoring system
misses an error in the tail of the previous log as the logs roll (which could
in the worst cases mean loosing a couple of million or even more).

Multiple logfiles for a day are harder to search through when handling the
most common use case for log files, which is not debugging but ops staff
answering questions on 'what happened with X?'

What benefits does a split logfile provide? The one use case is transferring
logs across the network. However most log management systems already split
large logs up before transferring to the archive silo.

You're loosing a lot, but gaining nothing with split logs.

>> When logging high volumes

It makes no difference, when the market swings you're going to need an extra
200Gb anyway, whether the logs are in chunks or not. In large volume apps
deleting to make space is often not an option - you're logging in high volumes
for a reason (often regulatory).

>> fetching an arbitrary header from the right file for a given context can be
rather painful.

That's been a solved problem for a very long time. Just as recombining a
separate trace log with a main app log based on time stamp is a solved
problem. I believe this highlights the bigger problem, you can go to
university or a.n. other school and learn to be a developer. Where can you go
to learn operations? The net result is people come up with the own solutions
each time instead of building on work already done. That work is available but
its often passed on by word of mouth. Surprisingly hard even in today's world
to find decent blogs on ops work (thankfully there are more around now).

>> When a serious error occurs then as much context as feasible must be
provided

Yes this is the extraordinarily rare case where stack traces can be useful.
They still don't belong in the main app log though.

Push to a separate trace log and recombine on the fly when you need to view.
The benefits are to the ops team and the monitoring system.

Another approach not often mentioned is black box logging. Log to a circular
buffer in memory, the first bytes of the buffer are set to a pattern that can
be searched for in a core dump, the remainder is used as a circular buffer.
It's fast to log to, takes no disk space and provides crucial context in the
event of a crash.

Im not sure I'll be able to convince you based on your tone, but speaking from
experience, you are making the most common mistakes.

~~~
moe
_When you roll the logs at say 500Mb, there's the chance our monitoring system
misses an error_

Sorry but then your monitoring system is broken? Why is its operation tied to
log chunking?

 _Multiple logfiles for a day are harder to search_

Not really. Often enough interesting events span multiple days (or cross
midnight) anyways. Thus regardless of chunking you have to be prepared for
that case.

 _What benefits does a split logfile provide?_

I'll give you that "one file per day" seems easier at first, but I've been
bitten by that too often. A script to identify the file-range for a given
time-range via binary search is really easy to write, has only to be written
once, and the benefits of fixed chunks are:

Easier handling. If you need to offload "50GB" from a given log-host then
that's easier to do when you know that will be 50 files.

You don't risk a scratch disk running full (i.e. those small and expensive
spindles that take the initial load) when the application decides to have a
bad-hair day and put out 10x the normal volume for a while (e.g. interleaved
with stack traces).

 _> >fetching an arbitrary header from the right file for a given context can
be rather painful._ _That's been a solved problem for a very long time. Just
as recombining a separate trace log with a main app log based on time stamp is
a solved problem._

I said painful, not impossible. You say you're dealing with 5GB/hour logs,
then you must be well aware how long even the simplest merge-job takes on the
slice of a single day?

 _> > When a serious error occurs then as much context as feasible must be
provided Yes this is the extraordinarily rare case where stack traces can be
useful._

We must be living in different worlds then.

When errors pop up in the main logfile, which happens frequently, then the
last thing I want is to stitch together context from _other_ files.

A mere grep on a day's worth of logs takes about 10 minutes for us. Any kind
of text-processing or "for each line perform lookup in $otherplace" quickly
pushes that into the hours.

~~~
3amOpsGuy
>> Sorry but then your monitoring system is broken? Why is its operation tied
to log chunking?

Yours clearly bends some unbendable truths of the known universe ;-)

A monitoring system ordinarily samples the logfile at a given point in time,
that is an open followed by a close operation on the logfile (lest you run
into phantom disk usage issues caused by holding an open file handle to a
compressed / rotated logfile).

If between sample 1 and sample 2 the log file is rolled, your monitoring
system never sees the data between sample 1 and the end of the previous log.

Possible workarounds include logfile streaming (say via a pipe on the
filesystem) but that introduces much larger problems and would in no way be
compatible with the app rolling its log files.

A much simpler fix is to roll only on events (bounced the app etc.)

>> A script to identify the file-range for a given time-range via binary
search is really easy to write

Contrast with no script required. No bugs, no maintenance time, no different
versions in different regions. Just no script. The benefits don't stop there -
when you need to query, there's no waiting for a bunch of disk seeks while the
script runs against a busy disk controller - you can just get straight on with
parsing the results.

>> Easier handling. If you need to offload "50GB" from a given log-host then
that's easier to do when you know that will be 50 files.

Moving 50 items is easier than moving 1? I'm unconvinced but still open to any
interesting ideas here.

>> You don't risk a scratch disk running full

There is no change to the risk profile, neither approach is better in this
regard, unfortunately.

>> e.g. interleaved with stack traces

Just one of the many excellent reasons not to log them in production.

>> you must be well aware how long even the simplest merge-job takes on the
slice of a single day?

Yes - it runs at full disk I/O speed with almost no CPU overhead. Remember we
are merging sorted data (already in temporal order).

>> which happens frequently

It's clearly a high profile app you're working on (in my experience the people
in charge of the pennies refuse to sign off on the cost of disk space for this
amount of logging without solid reasoning) i'm surprised it behaves as badly
as you say. Concerning, but then it also appears you don't have an ops team
either.

>> stitch together context from other files.

I get the impression you're thinking that you should manually stitch these
together? On the very rare occassions you need this data, remember both are
already in time order, we just instantly merge on the fly.

>> A mere grep on a day's worth of logs takes about 10 minutes for us.

It sounds like you've got it all in one file, i.e. not splitting your
audit.log out from your error.log from your trace.log? - techniques which I
couldn't do without to be honest (the monitoring system would have to consume
every byte logged otherwise, when it really only needs to see a subset of
total log output - thus saving CPU time and disk seeks).

>> Any kind of text-processing or "for each line perform lookup in
$otherplace" quickly pushes that into the hours.

 _shakes head_

Would you humour me with some context here. What is your background? From your
replies i would say you are non-ops staff[1], self-taught[2] and with limited
experience[3].

[1] You don't demonstrate any knowledge of basic tools, e.g. you mentioned
writing a script for a problem solved by a 35+ year old command found on any
unix host

[2] You haven't volunteered anything more advanced than obvious / first order
approaches. Approaches that i'd expect any CompSci university grad could
suggest off the cuff with no prior experience.

[3] You mention grepping through a single logfile of at least 35Gb, likely
much larger.

~~~
moe
_Possible workarounds include logfile streaming (say via a pipe on the
filesystem) but that introduces much larger problems and would in no way be
compatible with the app rolling its log files._

What you call a workaround (streaming to a central location via syslog or
scribe) happens to be the standard approach in my corner of the world.
Analytics and monitoring operate naturally on the stream because, as you point
out, sampling on rolling logfiles is not exactly reasonable, neither is
scattering logfiles across application servers.

I'm leaving this discussion at this point because I'm not interested in your
condescending tone and insults. It appears you haven't even centralized your
logging, yet feel entitled to give ops-advice that is at odds with how the
rest of the world operates.

~~~
3amOpsGuy
>> streaming to a central location via syslog or scribe

Ahh, "man mkpipe", you're mixing up a separate concept - streaming to a remote
host vs streaming via a pipe on the filesystem.

>> standard approach in my corner of the world

Syslog streaming has been a common approach since the ~70s i'd guess? Likely
before.

>> yet feel entitled to give ops-advice that is at odds with how the rest of
the world operates

Hopefully, you see the irony in this.

~~~
moe
Cut it out already...

 _you're mixing up a separate concept - streaming to a remote host vs
streaming via a pipe on the filesystem_

You said above, quote: _but that introduces much larger problems and would in
no way be compatible with the app rolling its log files._

Why would you have log-pipes, or log-files on your app-servers to begin with?

 _Syslog streaming has been a common approach since the ~70s i'd guess?_

Then how come you're not doing it?

 _Hopefully, you see the irony in this._

All I'm seeing is a constant stream of arrogance that doesn't seem to be
backed up. Also, the command that you wanted to helpfully point out is called
"mkfifo" or "mknod".

~~~
3amOpsGuy
>> Hopefully, you see the irony in this.

>> All I'm seeing is a constant stream of arrogance

Perhaps not...

------
Pfhreak
UTC is great for timestamps, but he needs to make his times sortable (YYYY-MM-
DD) Tab delimited fields are a problem, unless you escape the tabs inside the
data you log.

This format seems VERY similar to the Graylog Extended Log Format (or GELF)
<https://github.com/Graylog2/graylog2-docs/wiki/GELF>

I'm a huge fan of JSON logs now, having just finished implementing them on a
major work project. Every language out there can take JSON in and do work
against it, and there are some really awesome tools for rendering filtered
logs to a browser in realtime.

~~~
mseebach
> Tab delimited fields are a problem, unless you escape the tabs.

I don't understand this? A tab is a tab?

> JSON

Downside: File size. If you have any appreciable level of traffic/activity
this matters.

~~~
eru
They should compress well.

~~~
Pfhreak
In fact, they compress to about a 3% of their size. Since they really only
exist in text file form for a limited period of time before being moved on to
a separate logging system, their exploded size isn't much of a concern.

------
VikingCoder
time is expressed in seconds? My computer does about 3 billion things every
second. I think milliseconds or preferably microseconds would be much better.
Different domains have different needs, though, I'm sure. Also, I'd prefer the
log specify the time zone it thinks it's in, rather than just convert to UTC
automatically.

pid - yes, knowing the process id is good, but knowing the thread is also
useful.

version - outputting this on every single log line is probably overkill. Ditto
for host name.

~~~
trotter_cashion
Thanks for the feedback. I'm with you on the milliseconds and threads bits. As
for hostname / version... how are you getting those into your logs? I've found
having version in the logs very useful (and wished I had it when it wasn't
there).

~~~
pydave
Couldn't you output the host and version at the beginning of the log? Any tool
that transforms the logs (splicing together logs from different servers)
should include these bits in it's output.

Also, why don't you use a sortable form for time? (With the most significant
parts first.)

~~~
trotter_cashion
My issue with host and version at the beginning of the log is that they'll
roll away when your logs rotate. If you've got a service that's rarely taken
down, anything written to the front of the logs is likely long gone by the
time you have a problem. That said, there may be a better way to handle these
two. I certainly don't like repeating information if it's not necessary.

~~~
nitrogen
You could just print the version, etc. again when logrotate sends your daemon
a SIGHUP.

~~~
trotter_cashion
Ahh... never thought of doing that. I like it.

------
jamesaguilar
Most important point: any log message that doesn't identify some property of
the request that originated it (so that you can identify requests leading to
erroneous conditions) is cause for immediate defenestration.

~~~
sarabob
Agreed. We use mod_unique to add a header to each response and make sure it's
included in each error log, back end access log and front end access log as
well as being viewable via Charles/fiddler at the client. Being able to tie
the various logs from different machines together has been invaluable.

------
philjr
This link from a talk by Jordan Sissel has some great information on logging
and information that's commonly missed in log files (even in some very large
products)

[http://blip.tv/carolinacon/logstash-open-source-log-and-
even...](http://blip.tv/carolinacon/logstash-open-source-log-and-event-
management-jordan-sissel-5123601)

------
memset
I personally have found a lot of difficulty with JSON-based logs, because
tools like `grep` and `cut` don't suffice; I usually have to write a more
complex regex with `sed` in order to extract useful information I want.

Anyone have a good utility for managing JSON in log files?

~~~
mgurlitz
Moving away from using regex to parse JSON, you can use Jsawk and Underscore-
cli to manipulate JSON like Awk or JavaScript.

<https://github.com/micha/jsawk/> <https://github.com/ddopson/underscore-cli>

~~~
robmil
underscore-cli is utterly brilliant for this, even better — dare I say it? —
than traditional log files with cut/grep/awk. Being able to treat logs as a
dataset that can be mapped/reduced/flattened/grouped/otherwise analysed is
life-changingly good.

About the strongest endorsement I can give is that it makes me wish _more_ of
my logs were in JSON format, rather than wishing that less were as I always
had before.

------
shell0x
I reviewed my some of my code and figured out that I'm one of this terrible
persons, which haven't spent enough time to show an useful output. I should
fix this now ;) A useful log output can be a real time saver. I also seen
applications which just showed "an error occurred", so I had still no idea
what causes this error message.

------
Jimbotron
In my experience the measure of good logging is the ability for someone,
either yourself or an operations engineer to be able to easily solve problems
using only the logs. They shouldn't have to refer to the source code.
Developers should be using this as the yardstick when asking questions about
their log output.

------
roryokane
So how do you implement this? In your Rails apps, what kind of library do you
use to write log messages in this format? Is it an open-source library or one
you wrote just for your company?

~~~
arockwell
Look at the ruby libraries for syslog. Iirc, there is a ruby standard library
for syslog.

------
slyphon
finally! someone needed to write this, if only to have something to point
junior devs to. :)

