
Logging vs. instrumentation - grey-area
http://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html
======
jacques_chester
I think, like most people, I will disagree with this:

> _In my opinion, my thesis from GopherCon 2014 still holds: services should
> only log actionable information._

This assumes that I have future knowledge of which logged events will be
actionable.

I used to work in the classifieds department of a newspaper. The rule of thumb
was that the cover of the newspaper was worthless by 11am, the classifieds by
2pm.

But that's not quite true. Every single edition of a newspaper retains a small
probability of turning out to be very important at some unknown future time.
Hence the practice of archiving.

So it is with logs. In my day job I have on several occasions turned to raw
logs in order to reconstruct the history of unexpected events.

What metric, to take an actual example, tells me that a user was trying to
evade export control by hopping to a VPN halfway through a session? Which
gauge tells me that an error was caused by the interaction of third party
javascript and an obscure web browser?

For myself, I draw the boundary differently. Logging is for events with
identity, events that will be inspected individually. Metrics are for events
that will never have identity and will only be understood statistically.

~~~
sagichmal
IMO the author buried the lede:

> Finally, understand that logging is expensive. I’ve seen entire teams of
> absolutely brilliant engineers spend years building, managing, and evolving
> logging infrastructure. It’s a hard problem, made much harder by
> overburdening the pipelines with unnecessary load.

It's ultimately a piece about engineering cost versus benefit. There's a real
cost associated with logging every event with identity, to create an audit
trail. Maybe that's necessary in your domain, for compliance reasons or
otherwise. In which case it falls into the bucket of machine-read structured
logging, and you design for it. But for lots of domains, it isn't strictly
necessary.

~~~
jacques_chester
In which light, allow me to rephrase the argument:

Log as much as you can afford.

Given that the cost of logging is low and continuously falling, the default
decision should be "log everything".

"Log only actionable messages" is analogous to advice like "always use bit
masks instead of booleans", or only "use two digits to encode years". It
assumes algo-economics that are no longer the normal case.

~~~
sagichmal
> Given that the cost of logging is low and continuously falling

Everything gets cheaper in time, but the cost of logging is actually quite
high.

~~~
jacques_chester
I don't know who downvoted you, though I still disagree.

If I was writing an embedded system, or a system with soft-to-hard realtime
requirements, or one where flat straight line performance was the only
requirement that counted, sure, I might ditch or constrain logging.

These systems might be common by manufactured volume. But most programmers
will never touch one.

You're probably thinking of I/O costs. Again, these aren't as a brutal as they
used to be -- SSDs swallow bursty writes with aplomb and when those aren't
enough, log-structured filesystems are good at smoothing out writes.

As for myself, in the work I do in my day job, I print out to STDOUT/STDERR
and let the platform wick those lines away to a central firehose.

~~~
sagichmal
> in the work I do in my day job, I print out to STDOUT/STDERR and let the
> platform wick those lines away to a central firehose.

I think this is the correct approach, but (to the nearest approximation)
~every organization I've worked at that does this has hit saturation limits of
"the platform" in extremely short order. Especially if you're doing audit-
trail-style logging, in my experience, this becomes the primary bottleneck of
your infrastructure very quickly.

~~~
jacques_chester
Did you come up with particular tactics? Any insights on platform bottlenecks?

------
jakozaur
Sounds like it was written by someone who lacks good infrastructure to collect
and search logs.

E.g. it is very useful to collect all user actions and other major actions. So
later when user face some problem and sends support a message you know what
was going on. You can send user a message that his issue was fixed instead of
asking the user for a reproduction steps.

The aggregate numbers poorly tell you a story what has happened to this
particular user and at scale there are bugs that just some ppl will hit.

You can process logs and produce the metrics out of them, but you can't do the
opposite. Saying logs are just human readable actionable errors seems
backwards to me.

------
vbezhenar
> In my opinion, my thesis from GopherCon 2014 still holds: services should
> only log actionable information. That includes serious, panic-level errors
> that need to be consumed by humans, or structured data that needs to be
> consumed by machines.

But what if I need some information to debug a problem in the future? I log
some related information and when error happens, I can look to the logs and
get that information. This might help to debug a problem. Ideally I would like
to have a values for all variables at any moment of time. Practically I can't
have it, but I can get some values, which, I think, are more important. So
later, when I debug a problem, I'll have more information. And if I don't ever
need that, no problem, disks are cheap and logs are small (if done properly,
there's nothing worse than searching for a needle in a 100GB log file).

For my programs I developed a simple rules. Log everything important with
DEBUG level (but it shouldn't be too much, don't log 10MB data for each HTTP
request). Log everything that happens in the system at a "higher" level with
INFO level (like "add record with text {}"). Log everything unusual with WARN
level (like user validation failed). Log every unexpected but recoverable
(application can serve other requests) error with ERROR level. Log
unrecoverable error with FATAL level and shutdown application. Works fine for
me. And TRACE level just for debug, it's usually turned off, so it might
produce any amounts of data.

~~~
allengeorge
I too thought that his logging recommendations were weak. Simply logging an
error (even with context) doesn't tell you much about how you entered that
state.

I follow these guidelines (similar to yours):

    
    
      - ERROR: Failed assertions, unrecoverable conditions
      - WARN: Exceptions (request is toast, but system works)
      - INFO: Important actions (entered a certain state)
      - DEBUG: Information associated with the actions above
    

Note that we too don't log full requests, etc. at DEBUG - that's way, way too
much. If you really want to do that? TRACE, but that's usually important for
the developers of those libraries. We're also lucky in that we use logback, so
we can change the log level of the system at runtime very easily.

In production we usually run on INFO and above and the logs aren't
overwhelming - though we're constantly trying to improve the quality and
consistency of the messages.

------
henrik_w
I have found tracing to be very useful when troubleshooting problems. Tracing
(or session-based logging) lets you activate detailed logging for specific
customers. The system I worked on processed SMS messages for mobiles. There
you could enable tracing on a specific mobile phone number, and whenever that
subscriber sent an SMS, you got more detailed logs for everything affecting
that SMS. Really a great tool. I've described it in more detail here:
[http://henrikwarne.com/2014/01/21/session-based-
logging/](http://henrikwarne.com/2014/01/21/session-based-logging/)

~~~
ktRolster
Your point is especially true when a bug is difficult to reproduce. In those
cases, good logging can mean the difference between a few hours and a few
months to solve a problem.

------
jorangreef
1\. Logs should contain as much information as possible in order to debug. If
you have to ask a user to reproduce a bug so that you can insert extra logging
statements then I think you're doing it wrong. You want to be able to go
straight to your logs and have enough information to find the problem
immediately without further delay. For example, how are you going to debug a
syncing error in a file sync app if you're not keeping detailed logs as to
scans, inotify events, incoming and outgoing transactions etc.?

2\. You almost want your log entries to tell a story, you want to spend nearly
as much time on your log entries as on your code. I think the more you get
this right, the better your software design will be.

3\. You also want to be able to grep your logs easily using a transaction and
user ID. For example, if you just fetched a batch of emails over IMAP, you
want to have a transaction ID to describe the session, and include this in all
related log entries for the session. You also want to include an account ID in
all related entries so you can see all fetches for a given user.

4\. Your log entries should do as much work for you as possible. For example,
if your log entry logs a filename or string that contains unusual Unicode
characters, then you might want your log function to also log what Unicode
normalization form the string contains (NFC, NFD, Mixed). This can make
otherwise complex bugs trivial to solve.

5\. Logs should also be distributed, in the sense that you don't ship all your
logs to a central location, but rather have each server manage its own logs.
If you need to query your logs, then send a query out to the relevant servers
and get them to do the work for you. If logs are always a fraction of a
server's data output then you can easily manage gigabytes of log data per
server this way.

~~~
sagichmal
> Logs should contain as much information as possible in order to debug.

There's a huge operational and engineering cost buried in that statement,
which the article is trying to unearth and provide better solutions for.

> 5\. Logs should also be distributed, in the sense that you don't ship all
> your logs to a central location, but rather have each server manage its own
> logs.

Though it does give you some nice properties, this is counter to a lot of
accepted best practice... is there a good distributed log query system? `dsh`
doesn't cut it, for the record :)

~~~
jorangreef
> There's a huge operational and engineering cost buried in that statement,
> which the article is trying to unearth and provide better solutions for.

Sure, there's always an operational and engineering cost to logging. I think
the original comment makes that plain.

------
kordless
I know something of logs.

Logging is figuring out how to keep any and everything in a small to mid-sized
time window which you hope (but don't really know) may be useful later in a
very small time window.

Monitoring is having something watch instrumentation (metrics) for anomalies.
This may include views you build up over time on the previously mentioned
logs.

If there was ever an industry that could benefit from AI, it's logging.

------
wslh
My two cents: if you need to instrument native Windows applications, including
Win32 and COM, my company develops three mature GPLv3 libs:

1/ Deviare In-Proc (like Microsoft Detours):
[https://github.com/nektra/Deviare-InProc](https://github.com/nektra/Deviare-
InProc)

2/ Deviare Hooking Engine (even novices can instrument apps):
[https://github.com/nektra/Deviare2](https://github.com/nektra/Deviare2)

3/ RemoteBridge (access internal COM and Java apps)
[https://github.com/nektra/RemoteBridge](https://github.com/nektra/RemoteBridge)

All these libs include examples and there are additional use cases in our blog
[http://blog.nektra.com](http://blog.nektra.com)

------
akeating
I think the author raises some valuable points but does not emphasize the key
point about logging; that _logging has relatively limited applicability_. If
you think of logging that way, you log everything, but keep only a time-bound
window of log data. Around two weeks is plenty because most of the times you
need to access the detailed log information is same or next day. I think
careful attention should be paid to annotating log messages in various parts
of the stack with some piece of data that can be used to correlate messages
that are important to your use case. E.g. make sure a customerId or invoiceId
or something logs through various levels of the stack.

------
pascalmahe
Interesting!

I must confess, I'm currently trying to get hard data on how our users are
using our product and the thought of logging every request had crossed my
mind. As an entry point to to finding the right solution and also because our
product is hosted on the client's server and rarely under heavy load (we never
have more than a dozen users connected at the same time).

I'm now going to look into instrumentation...

------
tptacek
Is anyone else here using Prometheus? I'd be interested in hearing about
experiences with it.

~~~
catwell
I am also interested since I saw the lightning talk about it at FOSDEM (video:
[http://mirror.onet.pl/pub/mirrors/video.fosdem.org/2016/h221...](http://mirror.onet.pl/pub/mirrors/video.fosdem.org/2016/h2215/prometheus-
a-next-generation-monitoring-system.mp4)).

For what it's worth, FOSDEM itself uses it for the real time monitoring of its
infrastructure.

------
u14408885
The author talks about Graphite and Prometheus as instrumentation systems but
they're for collecting data. What do people recommend as a linkable library
for collecting metrics in the service process? (before it is collected). A
quick google gave me 'go-metrics' and 'dropwizard metrics' but the code I'm
working on will require linking with a native C library.

~~~
tptacek
The author is a Go developer; if you are too, a good starting point is to use
expvar and then whatever tool your collector provides for collecting expvars.
Besides being in the standard library, there's the advantage of there being
lots of tools that speak expvar.

~~~
u14408885
Thanks for that. 'expvar' looks like a great hidden gem for go development.
Unfortunately most of our services aren't written in Go so we might have to
write our own if there isn't a handy library we could write bindings for.

