The second problem is to figure out if logging is affecting performance. Or if a couple of logs can overpower everything else in a specific edge case. This will need thoughtful rate limiting and a good knowledge of the use cases your system will be subjected to.
Third problem is journald. Those journal logs are the worst. The viewer journalctl is probably the worst log viewing app that exists. These log files also have the unique quality of not being of any use outside the system they were generated on. No way to open them in Windows. Even on Linux, the times are all wrong. I don't know if that is a problem with journalctl or my system.
Over the years I have come to the following rule: during the development design-code-test loop, if the code does not work the first time it gets a logging line where it failed. It helps with debugging right now, and makes a sharp edge surface by some definition. You don't know what you don't know, so you might as well have information gathering around entropy-generating places.
Other than that:
* Don't log in high throughput loops (test them extra!); if you must, use a rate limiting logger with expiring delay.
* Log important parameters
* Log assumptions. Especially about data. There's always way more entropy in data.
* This applies to user-reported bugs, which is a form of a failed assumption. All information necessary to debug the issue now must go to the log.
Also end-to-end testing >> unit testing
Looks deceptively simple. Typically a function implements some behaviour or computation, and hopefully does so correctly and efficiently. Its usage is, well, up to the user. With the requirement that we don't log in high-throughput loops, this is no longer the case: we also have to anticipate usage patterns.
To put that another way, you might not know at the time whether your function is going to be invoked from a high-throughput loop.
Is there a nice solution to this problem?
> Is there a nice solution to this problem?
There's two ways to do it from within the logging framework. The first way is that you can suppress duplicate messages with a filter, like the DuplicateMessageFilter . It wouldn't be that hard to add a filter that looked for the message template, and could suppress based on other properties on the LoggingEvent, such as the request ID or the overall similarity of the event.
The second thing you can do is add a budget for your logger, which will only allow it to log a certain period in a window. This is something you can do with budget evaluators .
Depending on the layer you put over SLF4J, you can also add conditional logging , which puts the logic at the application layer. Under this model, you'd have a counter or circuit breaker in the condition that would short-circuit the logging.
Meanwhile the general case is actually that actions are performed on a collection of things, although this is often not obvious due to OOP design. But when working with collections it becomes obvious that log messages should be aggregates over collections ("2 out of items failed, errors: ..."). This then creates magnitudes less of logging. It's just not mainstream now.
Many Java logging systems allow you to enable/disable logging per package, but this is still not enough for a per function decision.
Logging is a hard problem.
I have something for that. 
The fact that logging is at all intermixed with business logic I think is the main problem. I should be able to take a fully working logless program and apply different logging profiles to it. What gets logged should be a function of the person invoking the program.
I've been fighting this fight for a decade now, and I'm pretty sure I've never won it for a whole team. The best I got was that people would pre-populate bug fix code reviews with reasons why they couldn't sensibly unit test this particular issue. If you have had better success, please tell me your secrets!
Hmm, any examples of this?
> Hmm, any examples of this?
Or you can use logstash-logback-encoder, which has masking:
Completely disagree there. I've had quite a pleasant experience with journalctl. It has sane arguments and you can easily grep any relevant log messages from the journal. You can add custom fields to your log messages. You can even get the logs in JSON format to parse log message from any scripts. And you can even use the cursor feature of journalctl to keep reading the logs at regular intervals without having to keep a constantly running process and tail the logs.
I can understand people not liking systemd-journald on account of the binary logs, but I definitely don't understand calling journalctl the worst log viewing app.
Sadly $dayjob runs servers on linux and isn't going to merge the dtrace patches in there, and python's systemtap probes are… limited.
Apparently bcc can hook into USDT probes though, I should look into that: https://ish-ar.io/python_ebpf_tracing/
If you want more advanced, then you can look at libstapsdt  which enables JVM applications to create USDT probes.
Regarding BPF, the book is a must have for anyone doing serious stuff with it.
If such hardware isn't being used, then yes, sanity should be questioned for supporting Uncle Larry's Linux.
Performance isn’t much of a problem though. Have different logging levels and only turn on a high level when you’re trying to nail down a particular STR.
Spoiler -- it totally can. Kirk Pepperdine talks about logging as a memory allocation bottleneck 
So I gave my usual logging rant at a workshop that I gave. I was in the Netherlands about a year or so ago. And that night, they went and stripped all the logging out of their transactional monitoring framework that they’re using, which wasn’t getting them the performance they wanted, which is why I was there giving the workshop in the first place. And when they stripped out all their logging, the throughput jumped by a factor of four. And they’re going like, “Oh, so now our performance problem not only went away, we’ve been tuning this thing so long that actually, when they got rid of the real problem, it went much faster than they needed, like twice as fast as what they needed to go.” – The Trouble with Memory
Unfortunately it can be kind of hard to track the memory allocation rate over time, and it's typically not the sort of thing you're focused on. I put together an audio queue that will play the memory allocation rate as a sine wave  so I can tell when it's getting to be a problem.
Put highly detailed log messages into an in-memory ring buffer, and when you encounter an error, log out the entire contents of the ring buffer. That way you get detailed information when things go wrong, an inspectable-while-debugging record of actions, all while not overwhelming your log aggregation system with useless information.
Debugging is also the primary use case for logging program internals.
Analytics should be done on non-internal logs (such as traffic logs), unless you're just logging everything to the same destination without any structure, which is probably a bad idea.
Analytics instead should be a separate stream in most cases.
If you're placing the ringbuffer inside an appender, then the appender's encoder will format. However, if the ringbuffer itself is an appender, then you can store the logging events as is until it's time to dump them -- this is the approach in terse-logback 
Another option is to put the ringbuffer BEFORE sending it to the logging framework, so it's never processed by SLF4J at all. This means that it's more accessible to the application, at the cost of losing the initial timestamp of the logging event. This is the approach taken in Blindsight. 
Shout out to Bunyan, which is (AFAIK) the only server side logging framework that has ringbuffers out of the box. 
Logging can also be done in a separate thread, reducing even that exceptionally minor cost.
I've benchmarked code myself, and the flamegraphs will barely show the SpscArrayQueue in the profile.
Any examples you can provide?
A google search for "logging to a ring buffer" will provide more.
The articles speaks briefly on log severity levels, which is key to manageable logs. Make sure to specify what levels the application uses, preferably in some design document if one is being used.
For example, high severity error messages, should be only situations that should not normally occur or problems internal to the organization. For example, end user invalid credentials is not a high severity error, but invalid credentials towards the backing data store is.
Personally I tend to keep log severity thresholds local to the session, because that's the kind of filters I normally want to write.
Having a good dictionary of logging fields is something that is coming together more  although some companies will prevalidate log messages to ensure they meet the given conditions, like Palentir's Witchcraft 
Don't log at an unnecessarily high level. It's tempting to do this when you're working on a new feature e.g. if everything above WARNING goes to the console maybe you'll make your new feature all WARNING. This is partly just for ease of debugging while you're developing, and partly (especially for junior devs) because it feels like your new feature is way more important than everyone else's so it needs to be logged more loudly. The right way to do this is to make it configurable per-feature what logging levels go to the console.
The rationale was that knowing that every single log statement absolutely would be written to the log would encourage developers to think more carefully about their logging. From what I saw, it worked.
If you overuse WARNING or ERROR logs - as i initially did on joining the team - you will rapidly get feedback from other team members to knock it off.
In this case, it's a screen displaying monitoring information. And since we're all working from home, it actually means a web page we all have open and keep an eye on.
Daftly enough, we don't even call it the radiator, we call it the dashboard, but i thought "radiator" might be a more understandable name!
It's a pet peeve of mine to read "Failed to connect to server" or suchlike in an application log and then have to go capturing packet traces to find out what it failed to connect to exactly.
* use some kind of context for the transaction to include unique id that is common theme through the system (maybe unique id originating from external system). This is called mapped diagnostic context in Log4j. This lets you grep all your logs for information about single transaction. Even better if you do this consistently for all services.
* offload metrics from logs to appropriate tool
* offload audit information from logs. I have a service that gathers all information of type (service A called service B and got C result for transaction D) and I have a simple page that can find information for transaction D and produce sequence graph where lanes are services involved.
* do not log an information twice. Don't log an error and throw the error (either you throw it or handle and log). It is ok to log additional information when passing error that would otherwise be lost.
Oracle has this concept called "ECID" (Execution Context ID), where the first component in the system to get a HTTP request assigns it an ID, and then that gets passed down as a custom HTTP header through the system. Since a HTTP request can produce sub-requests (e.g. talking to backend micro-services), the ECID becomes hierarchical in which subrequest identifiers get appended to the ECID of the original request. The ECID can even get passed through JDBC into the RDBMS so you can correlate database logs with logs at the higher layers.
> do not log an information twice. Don't log an error and throw the error (either you throw it or handle and log)
Unfortunately, my personal experience is you can't always trust your caller to log exceptions properly. (Especially when you are just working on one component in some huge system over which you lack overall control.) Throw and log can add noise to the logs (the same exceptions gets logged two or more times), but it also reduces the risk that the exception just disappears and never gets logged at all. (Or gets logged incompletely – don't get me started about Java apps and libraries that rethrow exceptions without setting the cause, meaning you lose information on the root cause of the exception.)
2. Well, it is ok to log the error when it crosses component boundaries, this is basically what you do when you log an error before returning it from your service to another service. Ideally, there would be some coding guidelines that would enforce rather uniform error logging. We are enforcing this on code review.
I know this to be 100% true. I deal with it at my dayjob. However, I will always take the hard-line stance that the caller needs to be fixed in these cases.
I tend to add a unique id based which is usually just the first 7 characters from any hash function. Something like
date | md5sum | cut 1-7
I was debugging a lot of lambda functions and this really ended up saving a lot of time as at runtime that was about the only information you could get.
One other thing I found useful was to use JSON format for logs. I know there are many against this, but I found being able to add additional params without breaking existing tooling was a massive win, plus it formatted nicely in cloud-trail. Later on I realised most of our logs were processed by humans second and computers first, so having a machine readable format was really a benefit. Additional we were using Scalyr at the the time, so each new value in the JSON became an additional filter which was a very nice outcome.
If I could get my hands on the developers who wrote all the snowflake log formats I had to deal with on a daily basis back when I was doing support and ops …
I cannot emphasise enough how much better it is for searching the logs and running any kind of automation off of the logs (analytics, monitoring, alerting).
"Noisy" logs are often cited as a major problem, but well-structured logging allows you to filter out the noise very easily, and focus on only the log events that are relevant to your task.
Using them from the command line? Reading them just here?
Don't exist anything alike "nano server.log".
And if you are building something serious and with some sort of scale then you will need to have something like ELK stack anyway to aggregate and display data - there is no getting away from that.
Working with unstructured logs however is a mess. Perl / Awk helps of course but it is a lot of extra effort. And in the end, at scale you will just be writing those scripts to turn that mess into structured data to feed into that same ELK stack anyway.
So it's a matter of putting yourself in someone else's shoes, but the shoes are a future situation that you have to imagine.
- aggregation from multiple systems
- one-line per event (so I can use unix tools)
- (ideally) enough information to replay some action in the event of a bug
- being able to live-tail it within seconds
- putting a unique, grepable identifier next to each printed output
seems to imply text-based logs parsed via regex variant. Even better is structured logs parsed by field explicitly.
It's easy to remove that line later. It's a lot harder for ops to figure out what the hell is going on if it's not working.
2019-08-20 14:00:03.856 INFO Implements Ukkonen's suffix tree algorithm. For more on how it works, see https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.... or get a PDF from our file share at \\some\path\to\departmental\share\FileThatWasDeletedThreeYearsAgo.pdf
2019-08-20 14:00:04.137 INFO Implements Ukkonen's suffix tree algorithm. For more on how it works, see https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.... or get a PDF from our file share at \\some\path\to\departmental\share\FileThatWasDeletedThreeYearsAgo.pdf
2019-08-20 14:00:04.523 INFO Implements Ukkonen's suffix tree algorithm. For more on how it works, see https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.... or get a PDF from our file share at \\some\path\to\departmental\share\FileThatWasDeletedThreeYearsAgo.pdf
Something tells me that my logging under that heuristic wouldn't fly with sysops.
People frequently write business code like this:
# Setup the main data
# Do the other process
# Check if thing X due to some requirement
if (some check happens):
# Do the workaround
<some hack which currently works>
# Finish up
And provided this works, okay, no problem. But the reality is this is going to get deployed, hit some edge case requirements didn't cover, and start failing. Written as this, someone is probably looking at that output wondering where it's failing.
If on the other hand each of those comments was a debug log level line, as they should be, then at minimum the logs now have a mode where they document exactly what happens and why (of course a business process should be logged at Info IMO).
Needing to log at quite that level of detail worries me. It implies that non-deterministic behavior is pretty pervasive in the codebase. I'm generally a lot happier when the logging strategy on the happy path looks like:
1. Get all your I/O-type stuff handled before the call stack gets deep.
2. Log your inputs
3. Run the calculation
4. Log your outputs
If what happens in step 3 is deterministic, then it should always be possible to explain the output by looking at the input, and reading the code. And, if you've accomplished that, then it's straightforward to convert the logs into a test case.
Basically, we emit one log per request/response with appropriate severity level (INFO if request succeeded, ERROR if it failed) and the log contains all the interesting information that we care to log. This makes it very easy to track and debug issues at an individual request level, while not having to overly rely on external tools to parse/combine logs to reconstruct requests/sessions.
The other benefit of this is that we can feed these logs directly into log ingestion tools like Datadog and plot graphs, view distributions and set alerts on metrics present in logs.
> in addition to their normal log traces
Does this mean that normal logging happens outside the final log line? If so, then that sounds great. Otherwise I'd worry about losing state due to a bug.
I would go for a centralized logging system.
If you are working on a PaaS, probably has its own logging feature already built in.
If you are looking to self-host, give the ELK stack a try(many alternatives as well).
If you are looking a third-party service, look over at Loggly (many alternatives as well).
And the most important one is: export structured JSON. A centralized logging system will parse and index your logs.
In the ELK case, logs will be indexed in ElasticSearch. From there you will be able to not only filter and search but also create monitors and alerts on your logs.
Basically, if you have some kind of cloud based server infrastructure and you do logging, the goal is to get it aggregated in a place where you can do something productive with it. We are talking about a time series database where it is all about the metadata and less about the message. The message is still important but only if you can query it out of millions of other and possibly very similar messages without too much hassle.
Fallacy #1 is believing that a log event is a string. It's not. You might believe it is a string and format it as a string and you might even have some semi structured way of doing that and tools that feed your confirmation bias by treating logs as lines of text. But in the end you something is going to be picking that string apart to extract some meta data just so it can be inserted in a database and aggregated to the right place. Half the success is not losing the data that matters through wacky formatting, using the wrong tools for the job, etc.
Of course this picking apart if strings is a lot easier if you use some kind of standardized log event format. Which one is appropriate probably depends on where you aggregate your logs. The trick here is avoiding building assumptions about this into your software and avoiding prematurely discarding information that upstream aggregations might benefit from. A lot of stuff happens in between where you invoke some info or warn function in your code and where a message gets stored in some log event aggregator.
Some of the better logging frameworks typically have these features: log levels, an API for logging, an external way of configuring the output of the log engine (formatting, coloring, filtering things out), and some notion of attaching meta data to the events.
The latter is one of those things that are super useful but also one that relatively few developers even know is possible. If you are unfamiliar with the notion of a Mapped Diagnostic Context (MDC): this is where you put any structured information related to sessions, thread ids, and other context. Preserving this information all the way to your log aggregator is the goal.
I think we got some of this right in the standard Julia logging frontend. We allow the user to attach arbitrary key value pairs (name=data) to the event, and also automatically attach some statically known context such as source location.
(The loggers are also dynamically scoped to the task, which has several good properties. For example it gives great control over attaching context known by the caller to the callee's messages and over dynamically routing or filtering messages. Sort of like MDC but more convenient.)
I generally find it easier to have an explicit "context" object and pass that around in an operation, and then log once that context has been built up. If you're lucky, the logging framework will do that for you.
Bunyan did this the Right Way from the get go, using `log.child`. 
Spinning up an entire Graylog seems overkill in order to log a couple of dozen lines per day.
Have a syslog-ng running locally on all your hosts and forwarding logs to a central one. Applications can easily send logs to syslog, all languages/loggers support it.
The central syslog can listen for remote logs and save everything into /storage/hostname/year-month-date/file.log for example. This allows you to grep everything from that host.
It's simple and it uses very little resources.
is pretty good for structured logging. But as always: depends on your needs especially with regards to indexing as loki stores only the raw log data and set labels.
But putting new data into it via POST is pretty simple
You can set up with one of their Beelines, or just send JSON to the REST endpoint.
TRACE = low-level details
DEBUG = details
INFO = good to know
WARN = might want to look into this
ERROR = definitely look into this
FATAL = why it died
Just log directly to json, it makes any processing later so much easier.