
Application-Level Logging Best Practices - geshan
https://geshan.com.np/blog/2019/03/follow-these-logging-best-practices-to-get-the-most-out-of-application-level-logging-slides/
======
gregmac
Something I wish this discussed more was development vs production logs. It
talks about a unique request identifier across microservices, but this applies
within a single service too.

In your development environment, you're usually only doing one request/job at
a time, and so logging each step along the way makes it easy to follow and
debug. In production, with tens or even thousands of concurrent requests, this
quicky becomes basically impossible.

I've become a huge fan of fewer but "wider" lines. Take sending an email.. I'm
sure most people have seen (or even written) logs like:

* New message from me@example.com to you@other.com

* Subject "test 123"

* Generating message using template xyz

* Checking sender against block lists

* Sender block list check passed

* Connecting to SMTP server smtp1.example.com..

* SMTP authentication succeeded, server responded "235 2.7.0 Authentication successful"

* Send succeeded: "250 ok"

This is ok when you're writing the code initially, but much better is a single
line (or single structured message):

* Outgoing mail message. from="me@example.com" to="you@example.com" Subject="test 123" blocklist=passed template=xyz smtp_server=smtp1.example.com smtp_auth="235 2.7.0 Authentication successful" smtp_Status="250 ok"

All the context needed (whether succeeded or not), compact, and concurrent
requests are a non-issue. I can't say I'm fully on the "structured log"
bandwagon just yet (mostly because I haven't found a tool I like enough to
give up the simplicity of a text editor), but I can see the appeal for sure.

~~~
sdrothrock
One thing I try to be careful of, especially in your particular example, is to
prefix log lines with session IDs. If my app is sending multiple e-mails
simultaneously, I want to know which lines correspond to which sessions if one
of them went south, because they're all likely to be jumbled up.

This is doubly true for logging multithreaded processes.

~~~
griffordson
I'm now pushing application logging into a Kinesis Stream that gets firehosed
to both S3 and Elastic Search. This includes both request logs and logs of any
CRUD operation on my models (including all changed attributes with their
old/new values.) There are separate logs for each, but the CRUD events get
tagged with the ID of the request that triggered them so it is now trivial to
see everything that happened to a given record and which request caused it and
which user did it. And since all request logs have the session ID, I can also
easily see everything that user did during that session both before and after
the event in question. I've been wanting this sort of logging solution since
the early 2000s and I now have it and it is even more incredible than I
imagined.

~~~
acidbaseextract
Any blog posts or talks that inspired this? How do you handle retention? Any
libraries or tools that helped with it?

I can see the use of such logging clearly, but I work at a young startup and
implementing that ourselves sounds like infrastructure overkill. Maybe I'm
overestimating the work?

Thanks for any thoughts!

~~~
griffordson
Sorry, just saw your reply. I suspect you are overestimating the work. I just
added some Rack middleware to handle request logging. Then plugged into the
Rails model callbacks to record the CRUD events into a different Kinesis
stream. And we're a 3 dev shop and I did this on my own over a few weeks part-
time.

As far as retention goes, I just dump to it to S3 for permanent storage.
Kinesis Firehose makes that dead simple. I also push each event into AWS
hosted Elastic Search and I could theoretically purge older data, but for our
volume of data, I'll likely retain it there for years.

Edit: Also, I generate UUID in the Rack middleware and set a Thread Local
Storage variable that I can pull out in the CRUD logging so that I can always
see what Rack request caused the CRUD model event.

~~~
acidbaseextract
Sounds like I did way overestimate. Thank you for the structure!!

------
sebcat
Another best practice: Please do not use logging as an excuse to avoid
implementing proper error propagation in libraries or utility code.

Bad:

    
    
      void something_failable(void *data) {
        if (!is_valid(data)) {
          log_err("invalid data"); 
        }
        
        /* ... */
      }
    

Less bad:

    
    
      int something_failable(xxx_t *ctx, void *data) {
        if (!is_valid(data)) {
          return xxx_err(ctx, "invalid data"); 
        }
        
        /* ... */
      }

~~~
ensiferum
Another best practice, don't write logic to deal with BUGS but instead assert.

Bad:

    
    
      int something_failable(xxx_t *ctx, void *data) {
        if (!is_valid(data)) {
          return xxx_err(ctx, "invalid data"); 
        }
    
        /* ... */
      }
    

Less bad:

    
    
      int something_failable(xxx_t *ctx, void *data) {
        ASSERT(is_valid(data));
    
        /* ... */
      }

~~~
dkersten
Maybe "is_valid" is checking the validity of externally-provided/third party
data, that is, it being invalid is a problem, but its not a bug.

------
dmitryminkovsky
Missing is probably the the most important thing about logging: be vigilant
and never log sensitive user information or secret values like password reset
tokens. Quite a few cases of compromised secondary/tertiary logging
infrastructure leading to exploits of primary systems.

~~~
sethammons
"be vigilant" seems to be the mind set, but I can't help but feel a system
that protects from accidentally logging secrets is what is needed. Due to the
varied way people log and what they log and when, it is likely unsolvable in
the general sense.

We have structured logs (serialized objects as json that auto-get default
fields like app name, timestamp, and such). These structures can have fields
tagged as sensitive, and those values are redacted. This is still human error
prone.

~~~
Radim
> …a system that protects from accidentally logging secrets…

That ecosystem exists. There are SW tools, both on-prem and (less
interestingly) cloud/SaaS, that sit in the middle and look for personal and
sensitive data. You plug into their API, and then get to decide what to do
when such information is detected: ignore, quarantine, redact, alert…

(Disclaimer: I built one such tool, [https://pii-tools.com](https://pii-
tools.com))

~~~
sethammons
I'll look into this, thanks! I'm not sure how you will know hunter2 is a
password, but interested to find out :)

------
godot
Tangential: With the recent screw-up of Facebook logging user passwords in
cleartext in (what I presume to be) this type of logging, I thought (before
clicking) that an article named "Application-Level Logging Best Practices"
would say something about stripping out passwords etc. when logging anything
that involves user inputs. In fact, I thought it was what prompted this post.
:)

~~~
objectified
In case you are still interested in reading about the security aspect of
logging, there is the OWASP Logging Cheat Sheet dedicated to this:

[https://github.com/OWASP/CheatSheetSeries/blob/master/cheats...](https://github.com/OWASP/CheatSheetSeries/blob/master/cheatsheets/Logging_Cheat_Sheet.md)

~~~
stefant
Thx for sharing, interesting.

------
nurettin
I would like to stress the importance of buffering and thread safety in your
logging functions. A logger with buffering will far outperform just opening a
file, writing and closing, which is default for many loggers.

[http://fredpuls.com/site/softwaredevelopment/java/log4j/log4...](http://fredpuls.com/site/softwaredevelopment/java/log4j/log4j_performance_tips.htm)

~~~
taneq
Buffering is great as long as you can rely on the buffer getting flushed to
disk in the event that the application crashes. Even more so if you're doing
some kind of embedded development (which I realise is not the usual bread-and-
butter here) where your system can trip a circuit breaker or be reset by a
watchdog timer.

~~~
richardwhiuk
One tactic is actually to keep the log buffer in RAM and then you can dump it
from the core file. You can often have a RAM only buffer which has lower level
logs than that which you can log to disk.

~~~
tonyarkles
I absolutely love this approach and have used it successfully in the past, but
there's a caveat: I have only encountered a handful of people in my career who
know or care what a core file is, and even fewer who know how extract data out
of one.

If you're building something that's going to primarily have strong systems
people working on it, it's a great approach, especially for ephemeral state
logs and debugging. If the "general public" is going to be trying to extract
logs out of it, it's probably not very ergonomic for them.

Edit: taking it a level deeper, this is used in the Linux kernel during the
early stages of booting. When I was doing bring-up on a board, I wasn't
getting any console output when trying to boot the kernel. What I could do is
let it sit for a bit, reset, and dump the kmesg buffer from u-boot to see what
the kernel had been doing. Absolutely beautiful backdoor when trying to debug
broken serial output.

~~~
anitil
There's an article called something like "The logging system you can't have"
which talks about using a similar thing. In the case of a panic, the device
would restart and find the buffer in memory, allowing you to recover crash
data across boots.

From memory they then pointed all logs as /dev/kmsg rather than
/var/log/syslog or equivalent logger, so that all userspace logs were buffered
immediately into memory. I think that the system logger then read from the
buffer? Wish I could find the article.

------
StanAngeloff
...and make sure logging code is safe, e.g., no NullPointerExceptions,
OutOfBoundsExceptions, etc. (Dynamic languages make this more of a challenge)
I've seen too many systems where customers are affected because a logging
routine had failed.

------
jdwyah
Being able to change log levels on the fly with specifics granularity is a
game changer imo
[https://www.prefab.cloud/documentation/logging](https://www.prefab.cloud/documentation/logging)

Otherwise your logs are always either too verbose or not there when you need
them.

------
tomerbd
* No logging.

In a massive environment you would find that logging is almost needless and
metrics take control. Then if you have an issue you take recording of events
and replay on dev env instead of checking logs.

~~~
sethammons
Metrics are great. In general, I subscribe to "don't log successes" (with the
caveats such as log events tied to billing or that get passed as messages to
other services).

But what if your problem only occurs at scale? You won't be able to replicate
that you in your dev env. Error logs and event logs make sense in many (most?)
services.

------
vkaku
The deal with good logging is that good structured code makes debugging with
logging easier.

The unsaid rule - If you're into async and suddenly printing a bunch of
messages, logging will not help you. :)

~~~
marcosdumay
Log synchronously. Either exploit the POSIX propensity of buffering lines,
have an explicit synchronization on your IO routine (that doesn't need to stop
the business code), or log each thread to a different stream.

------
mberning
site seems down, but off the top of my head

log rotation

log aggregation

log errors/exceptions at the error level, and nothing else

make judicious use of trace, info, debug, etc. levels

properly namespace loggers so that the log level can be set appropriately
depending on the environment and current needs

get the fundamentals right - if you are logging an exception, do you actually
pass the exception to the logger? Same goes for things like failed API calls.
Don't just log "API call failed". Log the entire response including status
code and messages.

~~~
pojzon
Its missing a part about high frequency logging.

Something like "log level checks should take only a single clock cycle"

Looks more like a dev env best practices or apps with low traffic best
practices

~~~
YorkshireSeason
Would you be able to give (link to) information about best practises for high-
traffic apps?

~~~
stefant
Pls check [https://www.nearform.com/blog/pino-the-fastest-node-js-
logge...](https://www.nearform.com/blog/pino-the-fastest-node-js-logger-for-
production/)

