
Good Logging - henrik_w
https://henrikwarne.com/2020/07/23/good-logging/
======
perryizgr8
I feel that logging is a seriously hard problem and even experienced
programmers struggle with it. It is almost impossible to anticipate what
information will be useful later. So the first problem is to figure out the
right logs at the right places.

The second problem is to figure out if logging is affecting performance. Or if
a couple of logs can overpower everything else in a specific edge case. This
will need thoughtful rate limiting and a good knowledge of the use cases your
system will be subjected to.

Third problem is journald. Those journal logs are the worst. The viewer
journalctl is probably the worst log viewing app that exists. These log files
also have the unique quality of not being of any use outside the system they
were generated on. No way to open them in Windows. Even on Linux, the times
are all wrong. I don't know if that is a problem with journalctl or my system.

~~~
BenoitP
> So the first problem is to figure out the right logs at the right places.

Over the years I have come to the following rule: during the development
design-code-test loop, if the code does not work the first time it gets a
logging line where it failed. It helps with debugging right now, and makes a
sharp edge surface by some definition. You don't know what you don't know, so
you might as well have information gathering around entropy-generating places.

Other than that:

* Don't log in high throughput loops (test them extra!); if you must, use a rate limiting logger with expiring delay.

* Log important parameters

* Log assumptions. Especially about data. There's always way more entropy in data.

* This applies to user-reported bugs, which is a form of a failed assumption. All information necessary to debug the issue now must go to the log.

Also end-to-end testing >> unit testing

~~~
MaxBarraclough
> Don't log in high throughput loops

Looks deceptively simple. Typically a function implements some behaviour or
computation, and hopefully does so correctly and efficiently. Its usage is,
well, up to the user. With the requirement that we don't log in high-
throughput loops, this is no longer the case: we also have to anticipate usage
patterns.

To put that another way, you might not know at the time whether your function
is going to be invoked from a high-throughput loop.

Is there a nice solution to this problem?

~~~
maest
Sounds like you don't have enough information to make a decision when
developing the function so thw logging behaviour should be configurable by the
user.

~~~
MaxBarraclough
Yes, I suppose a simple parameter would do the job. Probably not worth
anything fancy.

~~~
the_af
A parameter where? Obviously not in the code of the function itself; polluting
your application code with flags such as log=true|false is obviously a no-go.
Especially since logging is only one such "cross cutting concern", one of
many, and you simply cannot go polluting your code with all of them because it
leads to a mess.

Many Java logging systems allow you to enable/disable logging per package, but
this is still not enough for a per function decision.

Logging is a hard problem.

~~~
Spivak
I mean there's no reason you couldn't have a thread-local override to the log
level and restore it after the function call.

The fact that logging is at all intermixed with business logic I think is the
main problem. I should be able to take a fully working logless program and
apply different logging profiles to it. What gets logged should be a function
of the person invoking the program.

~~~
the_af
Definitely. I was trying to argue something about intermixing logging with app
logic :) It indeed seems to me the decision about what to log should be the
caller's, not the callee's, and so it shouldn't be hardcoded within the
latter.

------
falcolas
My favorite logging trick that walks the line between “not too much” and “not
too little”: a ring buffer.

Put highly detailed log messages into an in-memory ring buffer, and when you
encounter an error, log out the entire contents of the ring buffer. That way
you get detailed information when things go wrong, an inspectable-while-
debugging record of actions, all while not overwhelming your log aggregation
system with useless information.

~~~
quietbritishjim
One disadvantage of that - though it only matters in some contexts - is that
you pay the cost of formatting all those log messages (or at least copying
enough data to be able to format them later) even if they're never read.

~~~
falcolas
The cost of writing data to memory is incredibly small; you'd have to be
working on something like gaming engines for it to even be noticeable.

Logging can also be done in a separate thread, reducing even that
exceptionally minor cost.

~~~
wsargent
Can confirm -- writing data to memory is incredibly fast. It's faster to use a
lock-free queue and use CAS to insert data though:

[http://psy-lob-saw.blogspot.com/2016/12/linked-array-queues-...](http://psy-
lob-saw.blogspot.com/2016/12/linked-array-queues-part-2-spsc.html)

I've benchmarked code myself, and the flamegraphs will barely show the
SpscArrayQueue in the profile.

------
xorcist
The best advice is hidden in one of the links in the article: Always include
some sort of session identifier, as to identify all the log messages that
belongs to a session or transaction.

The articles speaks briefly on log severity levels, which is key to manageable
logs. Make sure to specify what levels the application uses, preferably in
some design document if one is being used.

For example, high severity error messages, should be only situations that
should not normally occur or problems internal to the organization. For
example, end user invalid credentials is not a high severity error, but
invalid credentials towards the backing data store is.

Personally I tend to keep log severity thresholds local to the session,
because that's the kind of filters I normally want to write.

~~~
agumonkey
it's as if we should do logging modeling. There's structure to map logging to
the code (sub)processes and also a need to relate traces to lifecycle.

~~~
wsargent
This is what Honeycomb and Charity Majors talk about when they get into
observability. They talk about events rather than logs, and discuss how to
construct a meaningful event from a series of data points [1].

Having a good dictionary of logging fields is something that is coming
together more [2] although some companies will prevalidate log messages to
ensure they meet the given conditions, like Palentir's Witchcraft [3]

[1] [https://www.honeycomb.io/blog/event-foo-a-series-of-
unfortun...](https://www.honeycomb.io/blog/event-foo-a-series-of-unfortunate-
incredible-events/)

[2] [https://github.com/vektra/logging-
fields](https://github.com/vektra/logging-fields)

[3] [https://github.com/palantir/safe-
logging](https://github.com/palantir/safe-logging)

~~~
agumonkey
thanks, very interesting

------
quietbritishjim
A variant of the "No Screaming" rule (which funnily enough mentions "The only
acceptable differentiator is to use different log levels, like WARNING and
INFO"):

Don't log at an unnecessarily high level. It's tempting to do this when you're
working on a new feature e.g. if everything above WARNING goes to the console
maybe you'll make your new feature all WARNING. This is partly just for ease
of debugging while you're developing, and partly (especially for junior devs)
because it feels like your new feature is way more important than everyone
else's so it needs to be logged more loudly. The right way to do this is to
make it configurable per-feature what logging levels go to the console.

~~~
twic
I work on a small team which does its own operations. Anything logged at
WARNING goes to our production radiator. Anything logged at ERROR goes to our
production radiator and makes it play an annoying sound.

If you overuse WARNING or ERROR logs - as i initially did on joining the team
- you will rapidly get feedback from other team members to knock it off.

~~~
quietbritishjim
What is a radiator in this context? A bit of searching didn't turn up anything
except metal objects that exchange heat!

~~~
twic
Ah, it means an information radiator, which is a funky term for a big thing on
a wall that everyone can see:

[https://www.leadingagile.com/2017/11/information-
radiators-i...](https://www.leadingagile.com/2017/11/information-radiators-
information-vaults/)

In this case, it's a screen displaying monitoring information. And since we're
all working from home, it actually means a web page we all have open and keep
an eye on.

Daftly enough, we don't even call it the radiator, we call it the dashboard,
but i thought "radiator" might be a more understandable name!

------
danw1979
Amen to the idea of including host address and port in every log line about
connection success/failure.

It's a pet peeve of mine to read "Failed to connect to server" or suchlike in
an application log and then have to go capturing packet traces to find out
what it failed to connect to exactly.

------
lmilcin
I would add couple more things to it:

* use some kind of context for the transaction to include unique id that is common theme through the system (maybe unique id originating from external system). This is called mapped diagnostic context in Log4j. This lets you grep all your logs for information about single transaction. Even better if you do this consistently for all services.

* offload metrics from logs to appropriate tool

* offload audit information from logs. I have a service that gathers all information of type (service A called service B and got C result for transaction D) and I have a simple page that can find information for transaction D and produce sequence graph where lanes are services involved.

* do not log an information twice. Don't log an error and throw the error (either you throw it or handle and log). It is ok to log _additional_ information when passing error that would otherwise be lost.

~~~
skissane
> use some kind of context for the transaction to include unique id that is
> common theme through the system (maybe unique id originating from external
> system)

Oracle has this concept called "ECID" (Execution Context ID), where the first
component in the system to get a HTTP request assigns it an ID, and then that
gets passed down as a custom HTTP header through the system. Since a HTTP
request can produce sub-requests (e.g. talking to backend micro-services), the
ECID becomes hierarchical in which subrequest identifiers get appended to the
ECID of the original request. The ECID can even get passed through JDBC into
the RDBMS so you can correlate database logs with logs at the higher layers.

> do not log an information twice. Don't log an error and throw the error
> (either you throw it or handle and log)

Unfortunately, my personal experience is you can't always trust your caller to
log exceptions properly. (Especially when you are just working on one
component in some huge system over which you lack overall control.) Throw and
log can add noise to the logs (the same exceptions gets logged two or more
times), but it also reduces the risk that the exception just disappears and
never gets logged at all. (Or gets logged incompletely – don't get me started
about Java apps and libraries that rethrow exceptions without setting the
cause, meaning you lose information on the root cause of the exception.)

~~~
lmilcin
1\. The ID should ideally be from external system so that it is possible to
always locate the records by ID known to the user. Let's say I am support and
I want to figure out what happened, the user gives me the ID and I can use it
to find the logs for this unique transaction, even if the user did not get any
response.

2\. Well, it is ok to log the error when it crosses component boundaries, this
is basically what you do when you log an error before returning it from your
service to another service. Ideally, there would be some coding guidelines
that would enforce rather uniform error logging. We are enforcing this on code
review.

------
gbuk2013
And if you are building anything other than toy systems then please, for the
love of the FSM, use structured logging!

If I could get my hands on the developers who wrote all the snowflake log
formats I had to deal with on a daily basis back when I was doing support and
ops …

~~~
mamcx
I tried this, but, the only way to "use" them is to pay absurd amounts of
money for cloud tools or setup complex grafana + ... things.

Using them from the command line? Reading them just here? Don't exist anything
alike "nano server.log".

~~~
rob-olmos
If the logs are stored in JSON, the jq tool can be used. I understand it's
more complicated that grep for certain structures, but also having a library
of jq filters for the app may be useful and can be shared.

~~~
mamcx
But that requiere to learn extra stuff that is not the normal flow: Locate
that part of the log or scan it quickly.

------
politelemon
That's a good guide to follow, I appreciate it. The logging too much syndrome
is what I notice the most and I think it comes from a tendency to treat a log
file like debugging session eg console.log(). A log file is a tool to help the
future hapless developer (could be you) troubleshoot a problem with the
system.

So it's a matter of putting yourself in someone else's shoes, but the shoes
are a future situation that you have to imagine.

------
semicolonandson
Good logging for me involves:

\- aggregation from multiple systems

\- one-line per event (so I can use unix tools)

\- (ideally) enough information to replay some action in the event of a bug

\- being able to live-tail it within seconds

\- putting a unique, grepable identifier next to each printed output

------
Chris2048
> GREP-ABLE MESSAGES

seems to imply text-based logs parsed via regex variant. Even better is
structured logs parsed by field explicitly.

------
XorNot
My one principle for logging is that if you think your code needs a comment to
explain it, you probably actually need to be putting whatever you're writing
in a debug logging statement.

It's easy to remove that line later. It's a lot harder for ops to figure out
what the hell is going on if it's not working.

~~~
mumblemumble
2019-08-20 14:00:03.270 INFO Implements Ukkonen's suffix tree algorithm. For
more on how it works, see
[https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10....](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751)
or get a PDF from our file share at
\\\some\path\to\departmental\share\FileThatWasDeletedThreeYearsAgo.pdf

2019-08-20 14:00:03.856 INFO Implements Ukkonen's suffix tree algorithm. For
more on how it works, see
[https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10....](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751)
or get a PDF from our file share at
\\\some\path\to\departmental\share\FileThatWasDeletedThreeYearsAgo.pdf

2019-08-20 14:00:04.137 INFO Implements Ukkonen's suffix tree algorithm. For
more on how it works, see
[https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10....](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751)
or get a PDF from our file share at
\\\some\path\to\departmental\share\FileThatWasDeletedThreeYearsAgo.pdf

2019-08-20 14:00:04.523 INFO Implements Ukkonen's suffix tree algorithm. For
more on how it works, see
[https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10....](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751)
or get a PDF from our file share at
\\\some\path\to\departmental\share\FileThatWasDeletedThreeYearsAgo.pdf

Something tells me that my logging under that heuristic wouldn't fly with
sysops.

~~~
XorNot
"Probably" aka use some common sense.

People frequently write business code like this:

# Setup the main data doSomeProcessWithData(data)

# Do the other process doTheNextProcess(data)

# Check if thing X due to some requirement if (some check happens): # Do the
workaround <some hack which currently works>

# Finish up dispatchItToZ()

And provided this works, okay, no problem. But the reality is this is going to
get deployed, hit some edge case requirements didn't cover, and start failing.
Written as this, someone is probably looking at that output wondering where
it's failing.

If on the other hand each of those comments was a debug log level line, as
they should be, then at minimum the logs now have a mode where they document
exactly what happens and why (of course a business process should be logged at
Info IMO).

~~~
mumblemumble
Well, originally it was just a joke. But since we're going there:

Needing to log at quite that level of detail worries me. It implies that non-
deterministic behavior is pretty pervasive in the codebase. I'm generally a
lot happier when the logging strategy on the happy path looks like:

1\. Get all your I/O-type stuff handled before the call stack gets deep.

2\. Log your inputs

3\. Run the calculation

4\. Log your outputs

If what happens in step 3 is deterministic, then it should always be possible
to explain the output by looking at the input, and reading the code. And, if
you've accomplished that, then it's straightforward to convert the logs into a
test case.

------
alkonaut
Logging strategy for single user apps (desktop and mobile) is probably very
different from a multiuser app such as a web service, since things can
interleave and the volume is much larger. With a web service I’m usually not
as interested in what the user did fifteen minutes earlier because the nature
of the web usually means any information is kept on either end (a db, a
request) but in a client side app you will often want an _activity log_ that
outlines what the user did, and that’s not what naturally follows from a
normal “subsystem log” where different pieces of code just log what they are
up to.

------
cjfd
It can also help a lot if you log all incoming/outgoing packets in a format
that allows replaying the session.

------
gmurthy_
I've found a lot of value with canonical logging[0] for server-side logging.

Basically, we emit one log per request/response with appropriate severity
level (INFO if request succeeded, ERROR if it failed) and the log contains all
the interesting information that we care to log. This makes it very easy to
track and debug issues at an individual request level, while not having to
overly rely on external tools to parse/combine logs to reconstruct
requests/sessions.

The other benefit of this is that we can feed these logs directly into log
ingestion tools like Datadog and plot graphs, view distributions and set
alerts on metrics present in logs.

[0] [https://stripe.com/blog/canonical-log-
lines](https://stripe.com/blog/canonical-log-lines)

~~~
3pt14159
From the linked article:

> in addition to their normal log traces

Does this mean that normal logging happens outside the final log line? If so,
then that sounds great. Otherwise I'd worry about losing state due to a bug.

------
kostarelo
I'm going through this article and it seems to me that is suggesting very low
level techniques. Grep-able messages? Like ssh-ing into a machine and running
grep to see your logs?

I would go for a centralized logging system.

If you are working on a PaaS, probably has its own logging feature already
built in.

If you are looking to self-host, give the ELK stack a try(many alternatives as
well).

If you are looking a third-party service, look over at Loggly (many
alternatives as well).

And the most important one is: export structured JSON. A centralized logging
system will parse and index your logs.

In the ELK case, logs will be indexed in ElasticSearch. From there you will be
able to not only filter and search but also create monitors and alerts on your
logs.

~~~
kyuudou
Strongly agree with this comment. ELK and now EFK using fluentd are fairly
mature and the dashboards used to view the data e.g. Kibana, Grafana are full
of useful features to get around the "endless series of greps" method of
manually parsing through logs. Another benefit of offloading app, system and
security logging to a central logging server is removing a potential single
point of failure from simply running out of storage. Elasticsearch clusters
are easy to build out to scale, also.

------
jillesvangurp
It's a topic where people default to doing more or less counter productive
things because they are not clear on goals.

Basically, if you have some kind of cloud based server infrastructure and you
do logging, the goal is to get it aggregated in a place where you can do
something productive with it. We are talking about a time series database
where it is all about the metadata and less about the message. The message is
still important but only if you can query it out of millions of other and
possibly very similar messages without too much hassle.

Fallacy #1 is believing that a log event is a string. It's not. You might
believe it is a string and format it as a string and you might even have some
semi structured way of doing that and tools that feed your confirmation bias
by treating logs as lines of text. But in the end you something is going to be
picking that string apart to extract some meta data just so it can be inserted
in a database and aggregated to the right place. Half the success is not
losing the data that matters through wacky formatting, using the wrong tools
for the job, etc.

Of course this picking apart if strings is a lot easier if you use some kind
of standardized log event format. Which one is appropriate probably depends on
where you aggregate your logs. The trick here is avoiding building assumptions
about this into your software and avoiding prematurely discarding information
that upstream aggregations might benefit from. A lot of stuff happens in
between where you invoke some info or warn function in your code and where a
message gets stored in some log event aggregator.

Some of the better logging frameworks typically have these features: log
levels, an API for logging, an external way of configuring the output of the
log engine (formatting, coloring, filtering things out), and some notion of
attaching meta data to the events.

The latter is one of those things that are super useful but also one that
relatively few developers even know is possible. If you are unfamiliar with
the notion of a Mapped Diagnostic Context (MDC): this is where you put any
structured information related to sessions, thread ids, and other context.
Preserving this information all the way to your log aggregator is the goal.

~~~
wsargent
Agree with everything you've said, but MDC does have one big drawback: it's
only convenient when you're working within a single thread. If you're writing
asynchronous code (Scala, Akka, Vert.x etc) then MDC doesn't help you very
much. [1]

I generally find it easier to have an explicit "context" object and pass that
around in an operation, and then log once that context has been built up. If
you're lucky, the logging framework will do that for you.

Bunyan did this the Right Way from the get go, using `log.child`. [2]

[1]
[https://tersesystems.github.io/blindsight/usage/context.html...](https://tersesystems.github.io/blindsight/usage/context.html#mapped-
diagnostic-context)

[2] [https://github.com/trentm/node-
bunyan#logchild](https://github.com/trentm/node-bunyan#logchild)

~~~
jillesvangurp
With Kotlin, you there is a way to make the MDC work with co-routines.
Basically, it works by leveraging the co-routine scope to keep track of a
context. This ships as part of the co-routines library in the form of
extension functions on the slf4j API:
[https://kotlin.github.io/kotlinx.coroutines/kotlinx-
coroutin...](https://kotlin.github.io/kotlinx.coroutines/kotlinx-coroutines-
slf4j/)

------
smileypete
Maybe worth adding that for lower level software it's well worth baking it in
to start with, nothing worse than trying to add logging to a non functional
ball-of-mud ;-(

------
cstuder
Any recommendations for simple centralized logging for small/hobbist/non-
profit projects?

I'm basically looking for something self-hosted or cloud-based where I can
easily POST some data from Python/PHP/JavaScript/Bash running on two or three
servers and easily have a look at the logs streaming in.

Spinning up an entire Graylog seems overkill in order to log a couple of dozen
lines per day.

~~~
phaer
[https://grafana.com/docs/loki/latest/](https://grafana.com/docs/loki/latest/)

is pretty good for structured logging. But as always: depends on your needs
especially with regards to indexing as loki stores only the raw log data and
set labels.

But putting new data into it via POST is pretty simple
[https://grafana.com/docs/loki/latest/api/#post-
lokiapiv1push](https://grafana.com/docs/loki/latest/api/#post-lokiapiv1push)

------
Yetanfou
Here's my background in forestry shining through - when I read the title I
though the article was about a good way to harvest trees. Maybe it is a sign
of the times that I was a bit disappointed to see the article was instead
about process data collection, something which is part of modern logging (in
the forest) but far from the most interesting or challenging aspect.

------
oftenwrong
Here's my interpretation of log levels:

TRACE = low-level details

DEBUG = details

INFO = good to know

WARN = might want to look into this

ERROR = definitely look into this

FATAL = why it died

------
hanfsi
My tipp for good logging: Log in json.

Just log directly to json, it makes any processing later so much easier.

