> And similar for things like DEBUG or TRACE. Usually you reach for those log levels when you want to see extremely verbose information, but... That's not practical to enable in production environments, because the volume of logs you produce would be way too high. And if you can't use it in production, you probably shouldn't use it elsewhere! That's what your local debugger is for.
That's why you want DEBUG and TRACE. I'd rather have useful logs as well as a debugger when running locally.
I agree. I found the argument against DEBUG and TRACE to be particularly weird, since it's very rare that you enable these log levels in production code.
While debugging and testing locally, though? They're extremely useful.
I think they're also useful in production code, although much more rarely. I've had a few times when a customer is having a strange problem that higher log levels helped to quickly resolve.
Being able to have them enable the higher log levels, reproduce the issue, reduce the log level to normal, and send you the logs can occasionally find an issue that would have taken forever to track down otherwise.
And that's not even to mention special processing. In the unices, anyway, the log daemon can take different actions depending on what level the log message was issued at. CRITICAL errors can result in automatic notification to a dev, for instance.
I've used DEBUG trace in a production system by capturing it into a circular buffer. If you hit an error you dump it, otherwise it's thrown away. It's gives you wonderful visibility when just when you need it: which is when the system fails. The difficulty is the run time cost. While capturing the raw data into a circular buffer is pretty cheap, formatting it the way the logging system needs is expensive. The workaround is to defer the formatting until you need to spit it out. Most logging systems don't let you defer formatting, so you end up having to roll you own. It's worthwhile to do that IMO.
As for the articles thesis you own need two debug levels: seems wrong to me. If you look at it from the perspective of target uses, there seem to be at least three levels. There is "I've died for this reason". That's CRITICAL. Then there is the user of your code, who is trying to debug it's interactions with other systems. That's INFO. Finally there is the person trying to debug the internals. That's DEBUG. In reality there seems to be a fourth useful level: I've got bad inputs or in a bad state, but I'm continuing anyway. Examples are API calls with bad parameters or a storage system running low of space. That's WARN.
This sounds extremely appealing as I either have to take the performance hit and nose through the entire debug log to get to where the error occurred, or have errors occur and not be able to reconstruct the events that drove them.
Any tips on rolling your own? My first whack would probably be treating it like a packed message, e.g.
with everything aside from the format string having agreed upon byte-counts, and probably have the file names getting dumped to a lookup chart. There's probably something more elegant, but I've never been good at straying too far eclectic.
Ferrous Systems made defmt[1] for Rust. The linked design page details the inner workings. One trick is that even the format string is packed into a lookup table. They're all stored in a lookup table in a special .elf section, and only the index into that table is logged. So it's more like <format string index> <[length], n packed bytes for first variable> <[length], n packed bytes for second variable>… etc. Some types don't need a length, if not needed it is omitted. There's no need to transmit the file, line, or number of args because that's all in the lookup table, indexed just like the format string!
E.g. a slice needs a length.
defmt::error!("Data: {=[u8]}!", [0, 1, 2]);
// on the wire: [1, 3, 0, 1, 2]
// string index ^ ^ ^^^^^^^ the slice data
// LEB128(length) ^
LEB128[2] is the same compressed integer encoding scheme used by DWARF & WASM.
The entire "Data: {=[u8]}!" format string is just one byte on the wire (assuming it's in the first 255 or fewer log statements, the index is also LEB128 compressed.)
I often enable DEBUG on specific conditions when troubleshooting. It helps if I don't have to invent a complete logging layer because some person had the idea that only INFO should be in the code.
The problem with doing this with logging is that the frameworks usually make it hard or impossible to selectively enable debug logging for just one customer, or just one in a million requests, especially if you want to do it dynamically (which you usually do). These patterns are far more useful than enabling debug logging for everything, because that just creates a spew and slows everything to a crawl. So you usually end up rolling your own sampling that you manage yourself.
Plenty of times that we had to run scripts by hand on production with the log levels cranked up to determine how something failed the way it did. Can't do that if you haven't already annotated your tools with the logging statements.
Yep those messages are really handy even if you never enable them in production. Being able to scan through log lines can get point you to the source of an issue much faster than stepping through code can. From there you can set a breakpoint and dig deeper.
And while, yes, you can have your debugger stop where exceptions are thrown, that's often not very useful because the exception happens after the point where you want to start seeing what is going on.
A counter example: in nginx I’ve used debug_connection on more then one occasion on production servers with hundreds / thousands requests per second and it did provide useful hints.
Log channels allow you to route certain messages to certain systems. CRITICAL, for example, can be forwarded to your "you have to deal with this RIGHT NOW" alert system (CRITICAL: Stock trading algo lost 75% of its portfolio value in the last hour). That's very different from a regular (ERROR: Endpoint /report responded with code 503).
DEBUG and TRACE can be combined with dynamic filters to keep the volume of messages down while still being able to get useful information from a production system. There are also dev and staging servers for running general DEBUG and TRACE in a lower volume environment.
I agree, this is a very clear fact that's readily understood by anyone with any experience developing and operating a service.
It's weird that we see claims arguing one of the core tools of troubleshooting isn't needed, without providing any sound argument.
Among the long list of arguments that can be presented, I'm only going to point out one which might be obscure: cost. Cloud providers charge a premium for data generated through log ingestion, and the more logs you generate the more you pay. Most of the time needlessly so.
One strategy to work around that is only logging error/critical in production. That avoids the invoice caused by all other log levels, which dominate the whole event dump. However, when the shit hits the fan you will want as much info as you can to help you quickly troubleshoot problems. With the standard log level hierarchy that's as easy as tweaking an environment variable in prod to output additional log levels. If you want to, you can drive the level down to dumping tracing info. Modern production-ready logging frameworks even support filtering log events based on the namespace/package/class level, for this very reason. You pepper your code paths with the relevant log messages corresponding to adequate log levels, and once you have your infrastructure sorted out you have all the tools you need to adapt your logs to the operational needs you have.
You absolutely cannot do that with simplistic info/error logs.
Why do you need logs for that though? You could also use metrics and define the alert routing based on thread holds and alerting rules there. Also has the advantage that you can alert before something bad happens by looking at how a time series develops (like for example disk space)
> Why do you need logs for that though? You could also use metrics and define the alert routing based on thread holds and alerting rules there.
No, you cannot. At least for most of the practical cases. Some upstream dependencies can and do throw critical/fault-level log messages which you do not control and cannot catch.
This scenario is frequent enough that some cloud providers even provide services that generate metrics streams by filtering logs.
Also important: the operational side of running a service. When the shit hits the fan, you do not have the luxury of redeploying production services just because you need to add a log call or emit a metric. For that, you work with what you already have in place. I was involved in a couple of SEVs where I personally created metrics filters and alarms on the fly to help troubleshoot issues and monitor relevant failure modes.
In terms of using stderr/stdout: yeah I think this is totally defensible. And depending on how hard you squint, and if you use Plan9, logging to a separate log file is just a metrics stream too, and it very well could be transformed into one before being sent anywhere else. This is overwhelmingly normal too, so it's not a hypothetical extreme at all - avoiding std* is often a good practice.
Semi-structured streams with basically unbounded cardinality are still useful tho, and I think that's arguably "a log" regardless of how it's represented. And then levels come back into the system.
You write alerts based on a few high level metrics, and then define where these alerts are router and if they are critical vs. bad but don’t need to page someone.
I would argue that the only three I care about in production software are INFO, ERROR (an issue occurred but the system is still running) and CRITICAL (an error occurred rendering a crash).
Any others are only really useful for debug situations.
In addition to the issues other comments raise, WARNING is useful separately because while INFO says "something happened", WARNING says "something weird/unlikely happened". I often use it for edge cases I don't think are possible but haven't been able to rule out, among other useful debugging strategies. Warnings may not require action like errors, but they should still stand out.
When I don't have LogLevel.Warning or the like, I end up doing things like adding a "WARNING:" prefix to the message so it stands out and can be grepped, but it's much better when I can just use the LogLevel to filter, color the message, show it in the UI, etc.
For me the difference between WARN and ERROR is in how likely you need to take an action if a error happens:
WARN - something is off a happy path but it's either out of you control (external system sent a broken request) or not urgent (e. g. use of a deprecated API) or should be handled gracefully (timeout for a sub-request with multiple retries, a request may still succeed but with high response time)
ERROR - you need to take action or at least may need - e. g. indication of a bug or a problem in the infrastructure (cannot connect to a database or disk is full, or a process has run out of file descriptors, an expired certificate, invalid auth token e. t. c.).
Of course one can lamp two these levels together but it makes Ops work harder.
> And similar for things like DEBUG or TRACE. Usually you reach for those log levels when you want to see extremely verbose information, but... That's not practical to enable in production environments, because the volume of logs you produce would be way too high. And if you can't use it in production, you probably shouldn't use it elsewhere! That's what your local debugger is for.
Oh boy, where to start....
1. Staging/pre-prod environments do exist.
2. Debuggers are nice, but they're not everything. I have a loop and want to see the distribution of execution times. Should I really use a debugger, writing down the info from each iteration? Or maybe something a bit more automated (like a log...)?
3. Being able to enable debug logs (for a specific part of the system) is extremely helpful. Datadog makes this super easy to enable finer logging at runtime.*
> [DEBUG or TRACE levels are] not practical to enable in production environments, because the volume of logs you produce would be way too high.
Let’s not forget, not all production environments are Google-scale! I’ve worked on mission critical applications with a few hundred internal users, tops. Turning on the debug logging firehouse can be feasible for short periods of time. More so if you can limit it to a targeted subset of users.
As a counterpoint to what appears to be the main points being made in this discussion:
For a team I was technical lead on a number of years back we decided to only log errors, and only if there was an identified known error path. We also documented the error and what action, if any was expected to be taken.
The team was relatively junior and had up to that point had a habit of scattering debug and info statements through any code touched by the feature under development.
This forced developers to diagnose development issues using better test coverage and to resolve and deal with error paths early.
When we went to production we had alerts that informed us anytime the log size > 0 kB and we investigated and resolved the issue as soon as we could.
It wasn't the most complicated application (online government form for applying for a pension), but did integrate with a number of other systems and deal with a fairly varied number of scenarios. We also had less lines of code and better codebase readability.
No idea if that policy is still in place but it did have a number of benefits.
I've seldom added anything but error or observability logging to anything I've worked on since.
One of the best things I did as one of the earliest employees at $company was helpers to setup Python logging and command line argument passing, so -v/-q flags controlled what level got emitted to STDERR, and other CLI flags configured where and what level logs got emitted. You didn't use print, just logged messages. Devs could stick in their informative messages at INFO, warnings, errors, critical errors at levels, debugging and tracing right down to DEBUG2 level, and ops could actually control how things worked in the software they needed to deploy. Consistently, without worrying about each and every special snowflake (at least as far as the inhouse Python was concerned). Timestamps even on INFO turns out to be really useful. Weird problem? Devs could get ops to crank up the logging on staging or production to help diagnose the problem. Everybody used it without being pushed, for at least 16 years (and is still preferred use AFAIK).
Log levels are a distraction. Your app shouldn't be deciding what's relevant, you should be emitting everything with tracing[0] and then writing alerts for fact patterns you care about.
[0] Not the trace log level, but events and spans, i.e. opentracing
Sadly, my experience has been that this is way too expensive, so you have to decide.
These days, something like 90% of the issues I to have to deal with observability are cost related. "You can't have that amount of cardinalities.", "Do you really need this metric?", "We've decided to turn off all logs in dev envs, they are too expensive.", ...
I am not sure why the author thinks that DEBUG level is about debugging your code before putting it in production. The real use of DEBUG is in production when one day the roof is caving in and the normal logs do not have information for you to figure it out.
Especially when the deployment is not under your control. That is when you ask to enable DEBUG so system can spew more information and you can trace the execution better.
OTOH, this whole idea of attaching trace-id (Android) and where the log came from (modules and functions) is nothing new. on the whole, I am not sure what the takeaway is.
So where do deprecation warnings go? I'm actually sympathetic to logs not being the best place, but I'm not sure what else to do with them. If you've got a backend/admin interface then sure pop it up there, but not all applications have such a UI.
A significant part of my job (sysadmin) has always involved running 3rd-party daemons on the company's servers to provide assorted services. It's semi-common for these services to throw deprecation warnings on startup like "you ran this service as fooservice --db-backend=sqlite --db-path=/var/lib/fooservice/db.sqlite; please be aware that sqlite support will be removed in version 9.0 next year". In a perfect world we'd catch that kind of thing by reading release notes, but 1. release notes don't always get read, 2. release notes don't always spell these things out as well as they should, and 3. release notes don't know what options you're actually using, so it can be way easier for the program to throw a warning when it sees you actually using deprecated functionality.
Comes down to speed requirements for me. If I’m coding in python and deploying it to an anaemic cloud function it’s going to be slow AF anyway. May as well throw in some DEBUG too.
Plus three levels makes more mental model sense to me. Debug noise, key info and somethings on fire
The idea that a debugger can be used instead of DEBUG log messages is somewhat laughable. Even in js client-side situations, a bunch of console.log calls can be a lot more informative than triggering the debugger.
That's why you want DEBUG and TRACE. I'd rather have useful logs as well as a debugger when running locally.