Hacker News new | past | comments | ask | show | jobs | submit login
You say “cave dweller debugging”, I say debug logging (sicpers.info)
115 points by ingve 52 days ago | hide | past | favorite | 135 comments



I have become less and less enamored of gdb (or IDE equivalent) debugging over my many years in software. Debugging is often very confusing with multithreaded code. It only works in one language at a time. If you're doing Java with JNI for example, the code called through JNI is a black box. If you are doing anything involving multiple processes, or distribution, debuggers are useless.

But the real problem, and top-level issue, is that you are constantly rooting around the very lowest level of your code, and it is difficult and labor-intensive to get a higher level view of what's going on.

On the other hand, well-designed logging code is there when you need it, can provide a view at any level of abstraction that you'd like (if you've written the logging code), and allows you to go forward and backward in time easily and repeatedly.

The project that really pushed me in this direction was working on a cluster. My piece of the system would start on one of the nodes, and delegate work to a fixed number of other nodes, or to all nodes. Out of necessity, I put a lot of work into debugging code, taking care to include the right information in each line (node id, time to nsec, source id, line number), carefully formatted to enable sorting and filtering, log rotation to make sure that I could focus on recent events or a longer timescale, and a lot of attention to the information included in each line and how it was formatted.

Because this was a long-running distributed system, I would often need to gather logs from across the cluster, going back hours or days, and merge and sort the files, and then debugging could start.

Going along with logging was heavy use of assertions, which were enabled even in production. I definitely did not want my code going forward doing random unanticipated crazy things past the point of an assertion failure.


If you're at the point where you need to attach a debugger, in production, you have already lost. I don't care how much you like debuggers and debugging. Anything, but the most trivial, single-threaded code will be problematic to debug.

So? What do you do? Logging at the right level + turn up logging. Metrics - proper metrics. Core dumps. Debugging turns into an exercise of grepping/inspecting the logs/metrics and/or working with the dump. Anyone that tells you differently is massaging the truth.


I agree. In today's world of cloud services, good logging is the single most important way to debug issues.


At the level of single process, on one machine, I've found a mix of debug logs, trace points, conditional breakpoints, and the ability to execute arbitrary code when at a breakpoint to be super helpful in debugging complex code with many layers. I was working on a compiler of sorts. The only time I had access to all of these was when working on C# in Visual Studio. Being able to run arbitrary code at the breakpoint-paused stack is incredibly useful.

With multiple processes or machines, it's helpful to pass around a trace ID and then have all the debug logs print the trace ID + timestamp at each event. This is a rather manual process; I wish there were automatic ways of doing this. At Google, the best we get is filtering logs by the RPC being processed at the time; but this ends at a particular process. Cross-process RPC traces are mostly sampled "stack traces"; you don't get to inject custom debug messages in those.


On the Microsoft world you get that automated via ETW and in case of running on Azure, Application Insights.

With debug symbols you can even pinpoint those events into code.

The only competition I know to these workflows is using Java alongside Flight Recorder.


Working with gdb is quite the learning curve. Luckily, it’s mostly unnecessary, except in the rare case where you actually need register-level debugging or hardware watchpoints. But it can still be done if you’re willing to suffer through the docs.

For debugging multithreaded code, you can run a gdb script in batch/non-interactive mode, and direct the gdb output to some out-of-band channel, like a file or other terminal session.

For debugging JNI, you can try running Java under gdb and set a breakpoint on the extension library entry function before it’s loaded.


... or you could just log.

I don't understand the extreme attachment to interactive debugging that I see in some of the comments here. It's like people are so attached to this idea that they will do anything to make it work as each complication is added.

Furthermore, the amount of attention and manual labor needed to use a debugger effectively is just staggering. One mistake, e.g. you step over a function you should have stepped into, and you've got to start over.

Or you could just log.

Yes, writing logging code is no fun. But you don't even have to do it all at once. In fact, you probably shouldn't. I often found that I didn't know what to log until I had to investigate some problems turned up in integration testing. So I added logging to solve the problem, and then benefitted from it forever.


The lure of gdb is that you do not need to write the logging code.

But I am reminded of the diddy: "If a system has a Powerful Debugger, what does that say about the apparent need for a Powerful Debugger?" --- I read that as: if you have a kick-a* debugger, you have a really complicated system. Maybe the answer is to make the system less complicated.


That doesn't make sense to me. Complication is mainly a property of the software you're writing, not the language in which the software is written.


It's true that at the end of the day unnecessary complexity is largely a matter of programmer discipline or lack thereof. However, some languages give the programmer considerably more opportunities for unnecessary complexity than others. And some language/runtime combinations make writing correct software impossible. For example back in the day Perl wasn't properly reentrant, so any signal could possibly lead to undefined behavior. There were ways to mitigate the problem, but not to eliminate it.


Yes.

Extracting simplicity from complexity is hard, and that is why most systems are unnecessarily complex. And I mean: 'unnecessarily' in that there exists simpler systems that perform the same function with less complexity.

I'm late-stage career here, so I've seen (most of) it all. There seems to be additional complexity piled on 'just because' so often. Occasionally, I see a system and I can't think of a simpler way to do it. That's rare. I bet it's rare for you, too.


> One mistake, e.g. you step over a function you should have stepped into, and you've got to start over.

Not with rr (rr-project.org). With rr you can step right back where you were. rr also lets you debug after the fact without interrupting your program, and lets you collect a recording on a cluster machine and then debug the recording elsewhere.


Debugging JNI is super easy in Eclipse and Netbeans.


> I have become less and less enamored of gdb (or IDE equivalent) debugging over my many years in software. Debugging is often very confusing with multithreaded code. It only works in one language at a time. If you're doing Java with JNI for example, the code called through JNI is a black box. If you are doing anything involving multiple processes, or distribution, debuggers are useless.

These are true of plain-old-gdb but they're not inherent to interactive debugging. rr and Pernosco address most of these issues and with more work we could do even better.

In particular: rr is great for debugging multithreaded code; record a race once and then debug the replay as much as you want until you've figured it out. rr will also record multiprocess code and let you debug whichever process you want; Pernosco goes further and supports seamless debugging across all processes. https://pernos.co/about/javascript/ shows how Pernosco can support debugging high-level code (JS) and C++ code at the same time. rr can record processes on multiple machines; we don't have an integrated debugging experience for distributed systems yet, but it's not hard to see how that could be done.

> But the real problem, and top-level issue, is that you are constantly rooting around the very lowest level of your code, and it is difficult and labor-intensive to get a higher level view of what's going on.

That depends on the debugger. Interactive debuggers can do higher-level things if we want them to, e.g https://pernos.co/about/expressions/


The problem is that in order to add a missing log in production you need to effectively redeploy. That isn't tenable for a polyglot production cluster. There's a rising field of developer observability which is specifically designed to provide polyglot cloud "debugging" capabilities. One of the capabilities is to inject new logs dynamically.


And if you need to fix a bug in production you need to redeploy too.

Bugs and logs and testing are all related. You obviously (should) do unit testing, integration testing, system testing before production. As you do these things, you will find and fix bugs. Ideally, you would add logging to help you diagnose the bugs you are fixing. Congratulations, you now have logging in code known to have been buggy, which will serve you well in production if bugs not caught by your tests turn up.

But bugs in previously reliable code do slip through, and this gives rise to the need for previously unanticipated logging. If either of these things happens a lot, the problem is your testing leading up to production. In other words, if you often find yourself needing dynamically deployed logging, then the lack of this feature is not the problem you should be paying attention to.

Also, dynamically deployed logging is basically shipping untested code, no?


The difference between deploying once and deploying just to add more logs. And more logs is a huge difference. I don't know about your environment but I can't push something to production on my own schedule. I need approval, process, etc.

This is painful to do just to add a log which might not include all the information I need.


The more constraints you have on pushing something to production, the less it makes sense to deploy logging dynamically! Those constraints are there to minimize risk, ensure quality and security, etc. Dynamically deployed logging seems to provide an extremely large loophole. For example, consider the recent Java logging fiasco.


Excellent point. The tool we use (Lightun) runs all of that in a read-only sandbox and limits the classes you can access (they are in a sandbox too) so you can't print "anything". It throttles your access so you won't use too much CPU with expensive operations too.

It also comes with PII reduction and block lists so you won't just log user passwords in the login code...

This solves the problem of over-logging: https://www.reddit.com/r/devops/comments/udgohy/there_is_no_...


Not necessarily. I've been at some places where you just "do it live." We had dozens of worker nodes. An engineer would "reserve" one for debugging purposes, editing python code live, adding logs or prints as needed and restarting services. Worst case a small percentage of requests was lost, though generally they would be retried except in the most extreme cases. This risk was considered acceptable.

Obviously, depending on the business and maturity level of the company, this could be a big no-no. In general, I find those "wild west" environments more fun. Waiting for an hour long build, deploys, and "approvals" to add a couple of prints is a big drag.


For that you have developer observability which doesn't carry the HUGE gaping security risks or performance problems. But still lets you debug the cluster at scale.


I’ve worked in that space. Observability tools may not give you all you need to debug certain classes of problems. I saw your other comments about Lightrun and will check it out though!


I've had luck with debugging locally, in a single-process context, not expecting to reproduce the problem when unfamiliar with the codebase to get a good understanding of what's happening when in the code.

That tells me where I need to add metrics, logging, etc for the distributed deploy.


Add thread ids to your logs


I would be so screwed without this 10 times a month in one messaging app I've developed to integrate our archaic pl/sql core banking system with 21st century reality.

Luckily its not performance-oriented but more reliability-oriented so I can get away with more logging than usually necessary. It seems obsolete and over-the-top till it saves everyone's a*


Sometimes the debug log is real product.


Context IDs are often even better since the physical thread executing some high-level operation might change (i.e. if you use a thread pool).


> If you are doing anything involving multiple processes, or distribution, debuggers are useless

How so? Can't you just attach to each process? Do you mean at production scale?


This reminded me of a hail-mary use of debuggers in the late 1990s...

Debugging a show-stopper synchronization flaw in the bootstrapping of a parallel job, I had to tell the job system to launch each of 64 nodes wrapped in gdb wrapped in xterm with remote X display back to a laptop. It was something that had "worked in test" reliably, but that was always on a smaller number of nodes or simulating a larger number with time-sharing. It seemed to need real parallel hardware allocations to show up.

So, I was able to launch the job, allow all nodes to run until distributed deadlock, then interrupt and show all thread stacks on one debugger after another until I found the odd process which was out of phase with the rest. As soon as I saw the stacks, I had no further need for the debuggers. Just knowing the "impossible" state configuration happened was the necessary clue.

I have seen lots of satisfactory use of logs for diagnostics, but this is a case where I think the debuggers were almost essential. The debuggers gave a distributed state snapshot due to the deadlock-induced quiescence of the whole system. To have logs show the same snapshot would require a perfect arrangement of unbuffered logging so that the final state would be visible and not caught up in RAM of some or all of the stalled processes.


Sounds like all you really needed something like a signal handler that would dump the stack traces of all the threads.

I've used that a lot black-box debugging java processes in production.

I also wrote such a signal handler for a ruby app.


In a distributed parallel computing scenario, dumping stacks from 64 processes via a signal handler would have created 64 different files scattered in 64 different remote filesystems. Ironically, the purpose of our library was to bootstrap a distributed communication network and so could have easily supported gathering of remote diagnostic data _if it were not the part that deadlocked_...

But worse, we were debugging our library linked into someone else's application. So, we would have needed their cooperation to add signal handlers. And, we naively thought our library had already been sufficiently tested and did not anticipate the need for last-minute debugging when our user moved their application to a different computing resource that was not available during prior months of preparation and prototypes.

So, the ugly app-in-gdb-in-xterm with X over Internet addressed all that with adhoc instrumentation and communications that could be folded into the existing parallel, distributed job on short notice.

This was also in the days of pthreads in C. The other valuable use of gdb I recall was for memory watchpoints to help track down unexpected changes to certain data. The other tool we got a lot of use out of in those days was Purify to help audit for use of uninitialized memory and for memory leaks.


> In a distributed parallel computing scenario, dumping stacks from 64 processes via a signal handler would have created 64 different files scattered in 64 different remote filesystems.

I was working on a cluster and often needed to do the same thing on all the nodes, run some database query or os command, and gather and integrate the results. I wrote yet another Python stream-objects-instead-of-strings shell (https://geophile.com/osh), which included features for distribution: Run the same command on all nodes, bring back the results, merge streams, gather files, distribute files, etc.

That was several years ago, and it wasn't a full shell. Since then, I've done an improved system that actually is a shell (https://marceltheshell.org, https://geophile.com/marcel).


One thing I have used is drop into the Python debugger (also ipdb) from within a signal handler. Then I can run a kill, and have it drop into the debugger.

There are some of cute hacks to get a remote debugger which I haven't used in years.

https://github.com/sassoftware/epdb can start and connect to a remote debug instance

https://github.com/gotcha/ipdb


Even in development.

You could, I suppose, but I'm not sure why you would want to. Going back and forth to two debuggers (at least) seems like torture. And with multiple processes, timing and synchronization issues could make interactive debugging a real nightmare. Why? Why would you do this?

Interactive debugging doesn't scale in any dimension.


I used intellij to debug 3 processes over ssh, it worked seamlessly. The problem with logging is that you have to know where to log. Debuggers are excellent for exploration, where logging implies you already put the code. If you forget some log you need to do the whole build/deploy dance, with a debugger I just restart. There are even time travel debuggers that allow replays.


> The problem with logging is that you have to know where to log.

I start with basic lifecycle logging, at an INFO level: The FooBar is created, the FooBar is destroyed, and major states in between. If a FooBar manages a set of things, then I might also add DEBUG level logging for operations on those things. But mostly, detailed logging comes later, as I investigate problems. You really can't anticipate too much logging, because you don't know what's going to need it.


It's really not that hard, especially if you can run them all from the same IDE. I can see how it starts to get out hand but it's not at n=2.


Early in my software career, I was debugging a compiler problem. Version X worked, version X+1 did not. I debugged by tracing both step by step, in two gdb sessions, looking for when behavior diverged. Pure torture, and that was an extremely simple case -- the processes were not communicating in any way.

All the usual problems, but doubled. Oops, I stepped too far, start over. Oops, the other one stepped too far, start over. Oops, start over, oops, start over.

Never again.


I'm having the same issues...

Although I'm not a full time developer, I do some small coding and I'm also sort of responsible for the dev team. I rarely use a debugger. I do most of debugging through logging. There are 2 advantages that I see:

   - it forces me to add the logs where they are needed to understand the flow of the application
   - it forces me to make those logs actually usable, readable and understandable
Developer's first response on a bug report is - give me logs. But I don't have them, because they were not coded. Or they are unusable because they are not logging the actual problem. Developer then insists the problem is fine because he can't replicate it in his debug environment instead of figuring out the issue from the logs. Making logs during development also forces developers to make a conscious effort of finding a balance of what and when to log something, create proper debug levels not to overwhelm the logs, while still providing some useful information...


One thing that it takes some time to understand (and to teach to junior devs) is not only to add logs, but also useful logs.

The stages of understanding go like this, in my experience:

1. Log nothing.

2. Upon being told to log things, write mostly useless logs: "Entering f() function", "leaving f() function". When f() fails, who knows what happened?

3. Upon being taught to add context so that logs can actually be used to troubleshoot problems, "entering f() with id 123", "id is 123", "writing 123", etc.

4. Enlightenment usually comes when logs include necessary info but are not redundant, are not noisy, and can actually be used to track down problems. Of course, improvement at step 4 is always ongoing, for all of us.


In my experience, (4) happens when the junior SDE is first on call responding to a problem happening with code they wrote.

They experience the problem with their own logging and realize through debugging their code what information they really wanted all along.


5. Get a ticket in your backlog to reduce logging because you've blown through your annual Datadog budget and it's only May.


> Making logs during development also forces developers to make a conscious effort of finding a balance of what and when to log something, create proper debug levels not to overwhelm the logs, while still providing some useful information...

The need for this balance is why I don't completely agree with the article.

If I'm debugging something with log messages, I put way more detail into it than I want to see in production even at LogLevel.DEBUG.


Plus, I'm adding more and more as I box in the problem. If I left that in I'd have an extreme level of detail around an aspect of the program that's very unlikely to have a bug because it's been thoroughly analyzed and fixed while being blind to everything else (except the parts that have received similar attention).

Maybe I'm just bad at it.


If I'm debugging something with log messages, I put way more detail into it than I want to see in production even at LogLevel.DEBUG.

I mean, there is LogLevel.TRACE when you need it. For those not familiar, this level is exactly for “I’m writing new buggy code where anything can go wrong so stupidest details matter”-time. You can use it to separate normal debug logging like “request body is {…}” from sandbox logging like “loop counter just became [object Object]”.


I don't know that there's an easy solution to this problem, but what I've seen Kubernetes and other Google-derived projects do is replace standard log levels with an increasing series of integers so you can log in increasing detail. This seems a little arbitrary (one developer's 8 might be another developer's 14 or something) but at least you can easily ask for increasingly more detail until you get what you need without noise-by-default.


The increased detail also clutters the source, though.

As I write this I realise that the "right" solution is that my IDE also folds away log calls of below some settable priority... Then I need a linter to prevent any code with side effects from running in a log line... Then I need to work in a language where that kind of static analysis is definitely possible... Maybe it's not so simple.


Create a logging method/function, llog() or something, with named parameters and whatnot.


This just creates a festering problem because devs are given too much freedom. A better solution is just two levels, basic and verbose, with subsystem feature flags you can control to prevent verbose from being a fire hose.


That sounds promising. Ever done it this way? Howd it turn out?


No doubt there is a place (or indeed a need) for debug (or trace) logging in environments where you can’t simply pause execution.

However, in development environments you would be foolish not to use the (vastly superior) tooling available.

I think many developers shy away from debuggers simply because they haven't taken the time to understand how to use them effectively, whereas everyone knows how to write a log statement. In previous roles I have given guidance on usage to fellow team members. It makes a vast difference on the resolution time of reported issues.


> However, in development environments you would be foolish not to use the (vastly superior) tooling available.

The debugger being vastly superior is highly dependant on what environment you are running in.


I have had cases where it takes ~2 hours to hit a bug case. You cannot afford that time waiting for the debugger. You must achieve that with logging. You might be able to figure out, with that logging, how to hit the bug faster and then you can use the debugger to step through, but I have found the use of a debugger to be beneficial in fewer than 20% of bugs I am trying to solve.


That sounds like exactly the sort of situation where you really want to use a debugger! I'd rather wait 3 hours to hit the bug and have full access to everything than wait 2 hours and realise that I didn't log an important variable and have to start again.


Debugger usage has been non-monotonic with time for me. When I first started coding I would print. Then I discovered debuggers and experienced a massive improvement in productivity on bug fixing and problem solving tasks. But over time I found myself reaching for the debugger less and less often as I became better at solving problems. In the last 5 years I have to admit I've barely used debuggers at all.


There are two situations that I encounter that make step through debugging difficult or impossible, and for a lot of my work both apply.

A) When developing for a microcontroller (however large or small). Stepping debuggers are limited, and sometimes unavailable at all. Most are clunky. ARM is the best here.

B) Running real time things. E.g. you can't step through a Bluetooth comms process because the moment you hit a breakpoint you aren't servicing a link and everything falls over.

There are some good tools like reverse time debuggers (e.g. https://undo.io/) that deal with the real time thing, but nothing deals with the combination. Plus a well maintained set of log outputs often tells you exactly what went wrong with just a cursory glance through the logs.


Oh yeah, most definitely - tooling isn't just debuggers, after all. I often use network traffic monitors, packet analysers, etc. No doubt in the microcontroller and EE world theres specific tooling too. That includes the humble logger :)


I will add

C) debugging signal handlers is somewhere between tricky and impossible depending on the scenario because of how debuggers work.

D) debugging optimized code. -O2 or -O3 code will have everything executing out of order, all variables are optimizes out, everything is inlined, etc.


I've used lots of debuggers and debugging tools, have a thorough understanding of their capabilities and have written and extended such tools myself. Print statement debugging will always be a tool in my arsenal. Sometimes - surprisingly often - it's just the fastest way to figure something out.


Hm. If its a timing issue, regular debuggers affect that and will perturb the experiment.

Fast, lightweight logging can be a necessary tool.


I can think of a single case where the debugger changed timing enough to make a problem I was working on not show up. I can think of dozens of times when logging did, though. I guess it’s dependent on the type of problems you work on.


Lots of ways to do logging. Create a circular buffer of tags and values at critical points - takes just a couple machine cycles to add an entry. When the problem occurs (break in the debugger) the table records the last N significant actions.


Debuggers are great (even when I have to resort to running GDB over SSH with no GUI help) but sometimes you have to debug compiler-optimized code, or asynchronous or long-term problems, and then logging can be a quick and effective alternative.


Sometimes you have to debug distributed solutions, and like you say you may have to do it in an optimized environment. You may also have to do it in an environment for which you have no other access than console output. I've been working in these kinds of environments for so long I've practically forgotten how to use a debugger!


I know your list is not intended to be exhaustive, but it's missing my main reason for using print: because the debugger can be really slow (at least in Python).

For some operations, running them in debug mode means they are 10 times slower, turning a 30sec operation (such as parsing big files into memory) into 5 minutes. If I'm relatively certain of what the problem is and I solve it in less than ~5 tries, I saved time.


Python, AFAIK, doesnt have a 'debug' mode, so its likely the IDE that is slowing down your debugging experience. Im not sure how you work around that - some IDEs do a lot of superfluous stuff when 'debugging', and may have settings to toggle features.

You can also just import the debugger where you need to use it, in which case you shouldn't get any slowdown.


> in development environments you would be foolish not to use the (vastly superior) tooling available.

I don't know. Learning and using a debugger adds an additional layer of cognitive load. Instead of thinking about the bug, you're thinking about using the debugger and trying to think about the bug. I have found their value to be limited with the code I normally write.


Sometimes an interactive debugger gets you lost in the trees so you can't see the forest. Not always the best option.


Sometimes threading issues are hard to debug, because attaching the debugger will impact the code behavior. Logging is a tool in the toolbox, I wouldn't dismiss it


If you can show me a debugger that will work on a PHP application in the Symfony framework, runs on Mac, and is not built on Eclipse or a similar lumbering monstrosity, I will gladly give it a try. (Hell, I tried PHPStorm, but it's just so painful to use...)

Until and unless such a thing appears, debug logging is pretty much what I've got.

(And given the complexity of the project in question, and the fact that I'm basically its only developer, debug logging is still pretty vital for running down the bugs the users run into in production. Good thing it's just a free hobby project!)


I dont personally use PHP so I cant really vouch for it, but I just asked a friend and he uses Xdebug in vscode (on an M1 mac).


Hm. I'll give that a try and see if it's worthwhile. Thanks.


Was that a conscious invocation of Cunningham's law? :)


IMO the "vastly superior" comment here hurts the credo of your argument.


There is a time and place to step through the code with a debugger, and there are also situations where a log file can show the "big picture" and display information at a higher level. In the latter case, a log can show in minutes what would take hours to ascertain if stepping through line-by-line.

My app formats the trace log as HTML tables, and uses color to signal errors or invalid cases. Makes debugging fun.

Also, if diagnosing a problem on a customer's inaccessible machine, a log is indispensable.

Tip: close & re-open the file for each write, so the last and most important tail end of the log doesn't vanish if the app crashes.


> Tip: close & re-open the file for each write, so the last and most important tail end of the log doesn't vanish if the app crashes.

It sounds like you're looking for fsync[0]

[0] https://man7.org/linux/man-pages/man2/fdatasync.2.html


I think it's more important to fflush after logging data, so userspace buffers don't get lost if the app crashes. f[data]sync is only necessary to protect against system crashes.


> Tip: close & re-open the file for each write

That seems very expensive. Why not just use line buffering or even flush from a signal handler?


Well... we wrote this when we started with Windows 3.1. Besides it's only used in special circumstances when we're tracking down a bug.


For my own code, 90% of the bugs are caught using logging and assert. The remaining are caught via debuggers. For working on unfamiliar code, using a debugger to step through the logic helps to understand the flow to find the problem.


And the remaining remaining bugs are not caught.


Caveman debugging is the method that overall has solved the most problems in the least amount of time for me. Some situations obviously request completely different tooling.


Couple of anecdata to cool off fans of logging.

Imagine you develop in multiple environments and in prod costs of keeping logs can only let you have few percent of it. Guess what would be there? Right, most spammy message that no one of team leads would recognize as useful. Checking what actually is in logs in prod env is a hard work and usually no volunteers.

Logs can mask problems especially concurrency related. I dont want to give examples here to protect innocent but i can say in 20 years i have seen cases.

Logs can be huge pain in the neck of your performance analysis team. Especially if they have to use instrumented environment for their work instead of prod. Even "disabled" things that don't produce output might still be doing enough work to make any prod and qc environment results completely not comparable.

Logs can be pain in the neck of security and risk teams if you happen to put your code into untrusted hands. Logs often used for reverse engineering your smart solutions with ease. My particular experience with game clients is especially bad in this regard.

PS. My personal best logs in development let you track single element of data from a whole lot of others. It takes time and lots of work to design such logging in complex systems. I have not met many systems done like that. People tend to just spam everything and hope for the best with text filtering. Then the stories of crashing text editors while opening giga sized logs are born xD.

just my 2c.


If checking your logs is hard work you should make it easier. I just click on my Datadog bookmark and select some filters to check logs.


I find debuggers almost entirely useless for two main reasons: (1) 99% of debugging is working out where the error is and (2) the answer to the question "how did this come to be?" usually has an answer involving several bits of code or data far away from each other.


Don't data breakpoints help with both of these things?


Only if you already know the value(s) being watched. If you know that you can also add trace level debug statements on those items in the code.


You can, but it's usually easier to set a data breakpoint than to find every place where the data is modified, add sufficiently verbose logging statements there + upstream of there, rebuild, and search through logs.


When you have logs you can see where the anomalies are and trace them back to the source. Things are wrong all over the place, but where did it start?

Often I'll identify a wrong value, or bad object, and then search for that through the logs to see what is touching it or where it first appeared. If it's interacting with another value I'll search for that, maybe in a longer chain. Eventually you figure out where the interaction is going wrong, and more often than not it's a subtle error that looks right but isn't. You make a small change (the classic is an off-by-one error) and suddenly everything works perfectly.


I use a combination of both. Logging usually requires less setup overhead, so I often opt for that. Sometimes though, the path it takes for the code to reach the part I'm interested in can be pretty obscure. It's in these cases that the debugger truly shines.

It also depends on how many things you are interested in. If you care about, say, a complex object with many properties, then the interactivity of a debugger trumps logging.


> Sometimes though, the path it takes for the code to reach the part I'm interested in can be pretty obscure. It's in these cases that the debugger truly shines.

I feel that it is the opposite. It is where the path is obscure that logging works really well. It can be iteratively refined and repeated.


I spent over 15 years of the start of my career at startups, and was typically the one woken up by alerts and/or the operations team (when there was one) to help decode alerts. That really shaped my logging philosophy around log levels.

FATAL = The system is basically non-functional, even in some partially-working way. Alerts fire.

ERROR = Something went wrong in some part of the system. The system may be mostly or partially working, but an important thing is broken and should be fixed. Depending on the where the error is, it may need to be immediately addressed or at least very soon.

WARNING = Something isn't working right, but it's possibly transient and the system has was a handling this via retry or other workarounds. This should be addressed at some point if possible, but shouldn't require immediate intervention.

INFO = This is the log level that the system is designed to run at. It provides enough human-readable context to see how the system is running. Most importantly, it provides just enough information provide context to WARNINGS/ERRORS/FATALS that may occur somewhere in a given request. The goal is to balance information w/performance.

DEBUG = All the things, all the time. Default mode for developers, and when diagnosing issues that require much more depth of information than provided at INFO level. It is VERY RARE to run at this level in production, as INFO logs should provide enough request context for a developer to reproduce most issues, but can be enabled for short bursts in production on one server in a cluster, or some other limited form to capture necessary information.

A guiding philosophy on all logs is to persist them in a central location (and locally w/rollover just in case persistence is broken), and is to flow some request or context ID through the system so logs from a single request can be easily identified and related log statements correlated with each other.

At my last startup, it was almost a year effort to clean up the logs. Such things like app pool crash on startup were logged at DEBUG, and the application starting successfully was logged at ERROR. Combined with an effort to improve overall quality, hit a first milestone of going 3 months without the ops team requiring escalation to developers outside of business hours (i.e. they could identify and fix things based on logs). Eventually, out of business hours alerts because a "I don't remember the last one" thing.


I liked this idea in principle but in practice I found there's just too many debugPrint() lines cluttering the code, and too much overwhelming console spam, so it's not worth it to keep around debug logging (or taking the time to decide what is worthwhile to keep).

When I fix a problem gnarly enough to require lots of logging everywhere, once I've tidied up I write a detailed comment or other documentation, and/or use unit tests to make sure the problem stays solved. The logging code doesn't get committed, which also helps keep the fixed code clear in the diff.

For the kinds of programs I usually write, I've settled on only using printing/logging for major events and for things that aren't errors but might require developer attention.


> in practice I found there's just too many debugPrint() lines cluttering the code

On this particular aspect - I find that log lines can do double-duty as code comments. When I see some comment that explains the what-and-why of a section of code - "// We also need to check the Foo state for consistency before committing Bar" - I might actually change it to a log line like - "Verifying state of Foo ${foo.id} before committing Bar ${bar.id}".


> overwhelming console spam

The article says to use logging and not actual printf debugging, which solves this. If your logging framework is even somewhat capable and modern it understands all of:

- named priorities: I want that FATAL log highlighted, but I probably don't want the DEBUG log shown at all unless I am currently debugging this

- automatic categories: os.mutex.internal and account.forex.hack can be automatically distinguished in most modern languages by the fact they live in different objects / source code files / even in C we know what the name of the current function is and can at least annotate that in logs.

- how to send the log messages somewhere useful, hopefully at least files and some primitive syslog type protocol, maybe a lot more

In a nice shiny modern logging framework I'd expect to be provided with default annotations so that e.g. when I error.log("Can't enrol students on a module that doesn't exist") the framework annotates the log entry with: The HTTP request that got us here, including e.g. query parameters and date-time stamp, and the Module we were trying to call a method on when it blew up.


You can often instrument code at build time to facilitate tracing, keeping your code free from much of the logging statements while still providing logs on function calls, arguments and return values. Obviously this (usually) isn't statement-level logging, but its often enough to diagnose issues.


Yeah but... aren't you tempted to leave the debugging code, in case you might need it again later?


To me it seems like if it's really that important, it becomes error checking or assertion code, or the code should be refactored to better enforce invariants, or the debug code is spun off into unit tests. That way whatever problem I was trying to detect is checked automatically in the future. And I admit an aesthetic preference for less logging code scattered about.

But I haven't found it very useful to barf lots of data into a debug log and then hope to find something useful sifting through it when there's a problem. Maybe it's just one too many information-dense things to juggle for me and smarter people have more luck with this approach. I'm not totally against the idea, I just haven't found a way that works for me.


This is where you need to get creative: save the log as HTML, and use colors, tables, etc. to organize the data, catch error needles amongst the data haystack, etc.

I find that having lots of trace data is useful because at a glance, you can compare problem situations against normal profiles and patterns.


A lot of logging practices do seem to orbit the idea that we should be able to handle this in a more automated fashion - i.e. we're missing a suitable zero-cost/minimal cost abstraction.

In an ideal world, without restart, an application would let me selectively switch on logging levels from "errors only" up to "line-level original source code in a targeted function" - with a trigger system.

Functionally basically a debugger, but coded into the binary so we don't pay the debugger cost (and also so we can try it more broadly across a cluster). This is the direction that Linux kernel eBPF debugging is really leading us, and what it should be easier to get that sort of power into our applications at a code-size but not speed cost - even if it means having our compilers just include multiple copies of the function with different logging-level functionality enabled.


I messed up pretty badly when I learned IT. To be fair, the Army isn't a great IT teacher, but I started my scripting life with Powershell (gasp) and worse, Powershell Integrated Scripting Environment (double gasp). They had an awesome thing called the "Powershell scripting games" and I managed to convince the government agency I was working for that counted as work for a week or two. Downsides of powershell aside, it did jumpstart me into coding...

However, one of the horrible habits it gave me that I absolutely adore now is having an interactive REPL for everything. If I want to understand anything inside my code, I can run just that little tiny bit, spit out the answers I need, and move forward. Half the time the best part was just having an open powershell session connected to the same env as the script. For me, it was the first time experiencing scripting with almost no downtime between iterations (most of the time) and the ability to inspect any/every object in play mostly seemessely. I moved on from Powershell to Python/Ruby, but I've basically gone all in on Ruby (devops, no one cares what I write code in) because I am just so much more productive with Pry and the other Ruby REPL options.

It seems to come back to bite me fairly often, but despite trying every ~12 months or so to learn a better way to work with the code, I never seem to find anything that sticks consistently. CodeRunner/random extensions in VS Code, REPLit.com, all the IDE's I could think of....no joy. REPLit comes extremely close, but its online only and that becomes an issue almost immediately. Still, REPLit is amazing when you just get pissed at a random python script that's spitting out junk and you want to figure out how a specific function works.

Maybe I'm missing something? Maybe I need to learn to code the "right" way? Maybe I'm just burned out? Idk, I will literally try any IDE/Tool that hints at giving me a better REPL experience or teaches me a better way to do it.


Both LISP and Smalltalk were entirely based around having this kind of hands-on, interactive, full development environment available not in prod as well. In some sense what you are doing was the future people thought we would have, and it is a regression in some aspects that most production software runs in this bare bones system where someone can't stop it at an arbitrary point, and then have all the tools originally used to create that software available to them to observe the system.


Nothing particularly wrong with Powershell, esp. given how much it's used with MS infrastructure; it's a high-value skill in IT shops everywhere, including the DOD's own. Nothing wrong with REPL's either.

The original author's point about debug logging remains, however.

These are all tools in your toolbox that fit different circumstances differently. Challenge yourself to learn new tools.


> REPLit is amazing when you just get pissed at a random python script that's spitting out junk and you want to figure out how a specific function works

How does Replit in particular help with this (vs anything else that lets you run Python, including /usr/bin/python?)


Look into Jupyter notebooks (or proprietary alternatives), for all the beauty of iterative coding, but in a better-than-powershell language.


Yes, debug logging is a great way to encode educated guesses about where unexpected behavior will appear in the future, in a way that should save time on net when a guess is correct.

But one should be aware of its limits. Debugging often involves performing something like a binary search over a collection of suspicious assumptions. The first few well-placed debug-logging messages can save several iterations in these binary searches. But the cost of saving one more binary search iteration in this manner increases exponentially. As with so many other things in engineering, moderation is a virtue; you want good debug-logging and an ability to perform quick manual tweaks to the code to cover the "last mile" of a debugging session, when this is at all possible in your execution environment.


Logging vs debugging is like hammer vs wrench.

They are different tools for different use cases. Neither is better or worse. You can read the comments from detractors and proponents of the two tools to understand when to use them.


I seldom use debuggers in the stereotypical way anymore. About the only time I use them is when I'm stepping through the code with some other person.

Dynamic instrumentation is key to things that don't scale naively, and surprisingly reusable (if left intact).

Debug logging is a baseline. If it's too verbose for monitoring a certain task, I've become a fan of counters being incremented in e.g. loops or other code points and printing them out at the end: the counts (especially zeros or bigger than anything else), their ratios and which paths were even run often turn out to be fruitful for statistical analysis when oddities are seen and this is especially true with any kind of data munging.

For services which run continually, have configurable printing for statistics. Or have a console which can examine dynamic structures. Or both.

Last winter I couldn't wrap my head around something Zeek was doing, so I ran it under Frida.

Any time I find myself using a debugger in the normal fashion, I find that I ask myself "should you be writing a test?" and not everybody can write tests at that level. In fact, I think running an interactive debugger is a good way to waste resources when you don't wanna write a test. I don't have much sympathy.


when outputting lots of information (like a running game) i found that a "csv-like" format is actually the best compromise between readability and being structured. Just log "header_1;header2;... value_1;value_2;..." you can still grep it, or redirect to a file and parse later


in Javascript applications, you can use `console.table()`: https://developer.mozilla.org/en-US/docs/Web/API/console/tab...


I can't believe I've never heard of this! Thanks


We've settled on RFC5424 syslog messages with structured data elements (https://datatracker.ietf.org/doc/html/rfc5424#section-6.3). You get a message, plus a bunch of key-value pairs of additional info:

<14>1 2022-05-06T21:01:24.54493Z web1.floren.lan webserver 20000066 webserver/search.go:188 [gw@1 searchid="13094420576" uid="2" user="john" query="tag=gravwell" start="2022-05-06T20:01:24Z" end="2022-05-06T21:01:24Z" background="false"] Search launched


I wrote about something similar many years ago, “Traces vs. Snapshots: Print Statements and Debuggers”, https://www.scott-a-s.com/traces-vs-snapshots/


I've been developing software for almost 3 decades (if you count the C programs I wrote in high school... longer if you consider Basic!) I can count on one hand the times I've needed to use a debugger to debug something. I prefer my cave-man logging.

I do find that I use Python's built-in debugger often during development, mostly to inspect objects as I'm writing code. I'll stick a breakpoint() in my script after an API call. It's often quicker than looking at Boto or Amazon docs, for example. But I never use it for actually debugging.


Ask a Google Software Engineer about debuggers. (they don't exist and printf debugging...aka logging ...is how debugging is done there).


The thing I like about debug logging is that it forces you to be more deliberate about what you are doing. It also provides good visualization of the program state over time. Plus, you may discover places that really should have permanent logging and improve the production code.


Ever since learning programming, I wouldn't write programs the debugger wouldn't attach to...


Here's a more-hands on take on this topic for those interested: https://sequoia.makes.software/lets-code-it-the-debug-module...


You can't debug a program that ran in the past, but you can read its info-level logs.


You can debug programs that ran in the past using debuggers like rr[0], which support both recording execution for later debugging, or stepping backward in a running process.

[0]: https://rr-project.org/


I'd go even further and say you can't debug a program that's running somewhere else, but you can read it's logs (in pretty much real-time if need be).


An idea that I've been kicking around is to, in Python codebases, have a @log decorator that first logs the name of the function, all of the parameters it's called with and a function call id, then the name of the function, its return value, and the same function call id. This can be applied to as many of the functions in the codebase as necessary to provide "replayability". Haven't got the chance to try it out yet.


    import logging
    import threading

    logger = logging.getLogger(__name__)

    def log(fn):
        def log_inner(*a, **kw):
           logger.info('%s(%s, %s) @ %s', fn, a, kw, threading.get_native_id())
           return fn(*a, **kw)
        return log_inner
It's pretty easy to grab all the arguments and thread, but grabbing the return would depend on you not mutating anything permanent, so that probably needs some special casing.


Thanks for typing that up! The test will be whether this will help as much as I think it would in a real system.


Depends on the program.

In critical systems, it's not unheard of to transaction log every input and every response to external requests, have strictly deterministic code, so that you specifically can replay or reconstruct a system state.

It's a pretty demanding way to build software, but you also get extremely resilient software as a result.


You can do this in Visual Studio, it's called Time Travel Debugging.


> Instead, call your syslog/OSLog/logger function, with an appropriate severity level (probably DEBUG) and some easily filterable preamble

Good patterns for JS here? Maybe console.debug() or console.warn()?


Not sure what the best practice is, but probably you don't want console.debug -- IIRC debug is logged by default, so unless you strip console logs in prod (which defeats the purpose), your users will have their console spammed.


Replay.io lets you add print statements after the fact, without having to "attach a debugger", modify your existing code or different layers of logging.


> without having to "attach a debugger"

I can't see how a thing that is basically a debugger except it remembers the program state after each statement, can possibly work without "attaching" it, or running the program under the debugger, like you would with any other debugger.

It's not like I can say, here, I've hit an error and now I want the history of the program's execution that was never recorded.


From reading their docs, they require the program to run with their debugger. Their debugger is a custom runtime for a given language or browser that's been compiled with special debugging bits.


The best part of this is that you get the credit for speeding up the application when you remove the extraneous debug printing!


Debugger? Logging? LUXURY! In my day we had ONE BLINKING LED and considered ourselves lucky!


Often logging can help guide a a debugger driven investigation.


>There are still many situations where it’s not feasible to stop a process, attach the debugger

For example when the person trying to fix it is not a programmer. Human-readable logs are immensely useful for system administrators. Note the "human-readable" - don't dump some random binary garbage or flag values; log something that's actually useful for getting a high-level overview of what's happening.

For example, if you look at FreeBSD's "iscsid -d" output, you'll see this:

  scsid: waiting for request from the kernel
  iscsid: not forking due to -d flag; will exit after servicing a single request
  iscsid: 127.0.0.1: global login_timeout at 60 sec
  iscsid: 127.0.0.1: connecting to 127.0.0.1
  iscsid: 127.0.0.1: Capsicum capability mode enabled
  iscsid: 127.0.0.1: fetching limits from the kernel
  iscsid: 127.0.0.1: setting session timeout to 60 seconds
  iscsid: 127.0.0.1: beginning Login phase; sending Login PDU
  iscsid: 127.0.0.1: key to send: "AuthMethod=None"
  iscsid: 127.0.0.1: key to send: "InitiatorName=iqn.1994-09.org.freebsd:v3"
  iscsid: 127.0.0.1: key to send: "SessionType=Discovery"
  iscsid: 127.0.0.1: key received: "AuthMethod=None"
  iscsid: 127.0.0.1: target requested transition to operational parameter negotiation
  iscsid: 127.0.0.1: beginning operational parameter negotiation
  iscsid: 127.0.0.1: Limits for offload "" are MaxRecvDataSegment=262144, max_send_dsl=262144, MaxBurstLength=1048576, FirstBurstLength=1048576
  iscsid: 127.0.0.1: key to send: "HeaderDigest=None"
  iscsid: 127.0.0.1: key to send: "DataDigest=None"
  iscsid: 127.0.0.1: key to send: "MaxRecvDataSegmentLength=262144"
  iscsid: 127.0.0.1: key to send: "DefaultTime2Wait=0"
  iscsid: 127.0.0.1: key to send: "DefaultTime2Retain=0"
  iscsid: 127.0.0.1: key to send: "ErrorRecoveryLevel=0"
  iscsid: 127.0.0.1: key received: "HeaderDigest=None"
  iscsid: 127.0.0.1: key received: "DataDigest=None"
  iscsid: 127.0.0.1: key received: "DefaultTime2Wait=0"
  iscsid: 127.0.0.1: key received: "DefaultTime2Retain=0"
  iscsid: 127.0.0.1: key received: "ErrorRecoveryLevel=0"
  iscsid: 127.0.0.1: key received: "MaxRecvDataSegmentLength=131072"
  iscsid: 127.0.0.1: target prefers not to do header digest; we'll comply
  iscsid: 127.0.0.1: target prefers not to do data digest; we'll comply
  iscsid: 127.0.0.1: operational parameter negotiation done; transitioning to Full Feature phase
  iscsid: 127.0.0.1: beginning discovery session
  iscsid: 127.0.0.1: key to send: "SendTargets=All"
  iscsid: 127.0.0.1: waiting for Text Response
  iscsid: 127.0.0.1: key received: "TargetName=iqn.2012-06.com.example:target0"
  iscsid: 127.0.0.1: key received: "TargetAddress=127.0.0.1:3260,257"
  iscsid: 127.0.0.1: adding target iqn.2012-06.com.example:target0
  iscsid: 127.0.0.1: removing temporary discovery session
  iscsid: 127.0.0.1: discovery done; logging out
  iscsid: 127.0.0.1: waiting for Logout Response
  iscsid: 127.0.0.1: discovery session done
  iscsid: 127.0.0.1: nothing more to do; exiting
If you (roughly) know how iSCSI works, this will tell you a lot, even if you're not a programmer at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: