A systematic approach to debugging

AlbertCory · on Sept 11, 2023

> Reproduce the bug.

Usually if you can do that, the bug's as good as solved. It's the ones you can't reproduce that take all the time. He does explain how difficult that can be.

Also, one of mine: "The hardest bug is where nothing happens." If something goes wrong, at least you can start from that. When nothing happens, you don't even know where it was supposed to.

BurningFrog · on Sept 12, 2023

> Usually if you can do that, the bug's as good as solved

Which is why you should try hard to do it :) A lot of people... do other things instead.

Maybe the worst is trying to think up a solution.

My book about debugging would be called "Stop Thinking and Start Looking!", if I wrote it.

jahewson · on Sept 12, 2023

This is something that never gets old to me - seeing other experienced devs playing guessing games instead of instrumenting the problem thing and walking through it step by step.

That’s not to say that you can’t guess the first, or second place to look, but that’s where it should end.

The second thing I see often from smart devs is that they find some bug, assume it’s the one they came looking for and declare victory.

The third is the worse by far, which is to blame the cloud provider/vendor for all issues, especially you not reading their documentation.

AlbertCory · on Sept 12, 2023

Somewhere in that hierarchy is:

Preemptively give up and say, "We can't reproduce that."

francois14 · on Sept 12, 2023

We work with the same contractors then.

funcDropShadow · on Sept 12, 2023

David Agans' book "Debugging" has a chapter called "Quit Thinking and Look". It is quite old by IT standards, but I cannot recommend it enough.

BurningFrog · on Sept 12, 2023

Wow!

I've never seen or heard anyone else express that idea before!

I guess I don't have to write that book then :)

AlbertCory · on Sept 12, 2023

I know this topic is pretty exhausted, but hey...

In the terrible movie The Internship they have this "contest" for the interns, where they're all told of a server that has suddenly stopped serving up digital photos.

So they all start writing on a glass wall. No intern says, "hey, let's go and look at the server logs!" Not cinematic, I guess.

kqr · on Sept 12, 2023

More generally, if people would go and see for themselves more often a lot would be better in the world. Mere observation is surprisingly powerful if you take the time to do it with an open mind.

hinkley · on Sept 11, 2023

"Nothing happens" is one of the legs in the argument of libraries over frameworks.

Frameworks are notoriously bad at doing nothing and saying nothing if you don't have your inputs just so. And then you're just staring at config files for an hour hoping you will achieve satori instead of rage.

AlbertCory · on Sept 12, 2023

while we're on the subject:

Learning a UI is especially bad that way. You double click, select a menu, etc., and nothing at all happens.

bfLives · on Sept 12, 2023

Greyed-out buttons are similarly terrible UX. Far too often, it’s impossible to figure out why a button is un-clickable.

eitland · on Sept 12, 2023

Possibly even worse:

When the UX designer decided that greyed out buttons are terrible UX and bans them, so instead of disabling them (greying them out) you have to remove them if they aren't clickable.

Now, instead of being puzzled by why he can't click the button the user doesn't even know there is a button and has to start on by solving another puzzle first before solving the puzzle about how to enable the button.

PS: back in the day this could be solved by a tool tip or a paragraph in the help file. Today it seems such helpers are forbidden too.

hinkley · on Sept 12, 2023

This here is why I think Bamboo is the worst CI tool ever written. Information hiding in a dev tool? Fuck everyone involved in that decision process.

I’ve used Cruisecontrol, which being first is uniformly awful by modern standards. But not lying about features awful. And spec files have partially vindicated some of their decisions. But only partially.

IIsi50MHz · on Sept 13, 2023

There used to be not-well-liked (by designers, or users) help system, which when the user toggled it on, would display a description of purpose what you pointed at, and — when very well implemented — an explanation of why an disabled. It was called Balloon Help.

Unfortunately, it had no idea what you already knew, and would continue popping up bubbles until you turned it off. Of course, there was a fair chance that within seconds of turning it off, you would want info on something again. So, you'd mouse over to small global icon for it, pull down a menu, choose enable, and then it would gleefully show balloons for much of what your pointer passed over on your way to the one thing yow needed help with…and again on the way back to the menu toggle to shut it off.

But eventually, some kind person invented a system hack that made it turn on for just the duration that you held a modifier key, and it was glorious. For users, at least. For programmers, it was still annoying, because you were supposed to create all those descriptions: one for each enable control or interesting element and, to do it Properly™ one for each possible reason that the given widget was disabled.

Most users never got to experience the modifier-key method. Most developers, if they implemented it, only provided the item descriptions, without bothering with the disabled state.

Microsoft decided to partially imitate Balloon Help by providing, and asking third parties to provide, a tiny question-mark button in each dialog box. Clicking it let you then click any one item to get a description of it. Which was annoying when you wanted to know about multiple items. And more annoying when many or all of the items you clicked didn't have descriptions. If you weren't in a dialog window, yow were supposed to use the program's Help option, which was completely different.

Both Balloon Help and question-click-help eventually were replaced by that unfortunate idea: a smallish floating window that is simultaneously too small to show the content (often having scrollbars in both directions) and too large to get out of your way. With seemingly nothing satisfying available, help systems continue to diverge, offering varying degrees of utility like local HTML that opens in a web browser (often slow to launch), actual online web pages (useless if not connected), sidebars that resize your workspace and open sluggishly, and animated talking characters that won't accept "Go away!" for an answer (::shudder::)…

On the whole, I'd take something like "Balloon Help with modifier key", except for when a tutorial would be more appropriate.

…

But back on topic of grey buttons, other terrible-but-trending designs are "everything is light/medium grey" (makes everything look disabled), "one or more enabled elements are fancy/coloured while one or more equally enabled elements are grey" (often used when the greyed options are "Obviously what any sane user would want, but that negatively affects some target metric").

ianmcgowan · on Sept 11, 2023

Another see also, a book I used to hand out to developers, support teams and DBA's: http://debuggingrules.com/. It's not CS specific, or even really computers, but the mindset is invaluable. It's also got some engaging war stories to keep you reading :_)

(or for a summary of just the rules) https://courses.cs.washington.edu/courses/cse474/18wi/pdfs/l...

hinkley · on Sept 12, 2023

4&5 is too slow by half, probably more.

As the number of potential culprits goes up the odds that you will identify the correct one first are very small. Most of what makes these processes slow is getting emotionally invested in solving the problem in as few 'moves' as possible. Occasionally that will work for you but it's not a safe long-term strategy.

What you want to do instead is to formulate several hypotheses, and then figure out the relative likelihood and difficulty in testing each one. Determine if partial tests of one hypothesis also satisfy or invalidates parts of others, and then begin the process of winnowing down possibilities in a calm and collected process of elimination, balancing cost of validating an assumption versus how much of the problem space it eliminates.

Once you find the bug you still have to fix it, and getting worked up makes for cures that are as bad as the disease.

lelanthran · on Sept 12, 2023

> Determine if partial tests of one hypothesis also satisfy or invalidates parts of others, and then begin the process of winnowing down possibilities in a calm and collected process of elimination,

Think of it like Wordle - with each guess you try to:

1. Use as many new letters that you can (so you can eliminate them),

2. Use all the almost-correct (yellow) letters in new positions,

3. Keep all the correct (green) letters in the same position.

So each guess is a test, where you try to invalidate as many letters as possible and satisfy the existing proven letters.

hinkley · on Sept 13, 2023

Yep, and sometimes you just test things because you know Steve won't let it drop until someone does. Eliminate the zebra just so people stop thinking about zebras.

rectang · on Sept 11, 2023

> My approach is systematic and focused on understanding first and foremost. This is for a variety of reasons, but principally that you need to understand what is going on both to fix it and to be sure it's fixed.

Feature-focused management often fails to take comprehensibility into account. Initial development hours are rationed carefully, but debugging gets an infinite budget, as if it exists outside the reality of delivering value.

Eventually is a tradeoff between the amount of time you invest in making the system comprehensible and the amount of time saved in debugging. But that's only after you've thrown down table stakes (unit tests, basic documentation, high-level modularization with loose coupling, preservation of critical invariants, etc.)

jonahx · on Sept 12, 2023

> Eventually is a tradeoff between the amount of time you invest in making the system comprehensible and the amount of time saved in debugging.

Well said.

> But that's only after you've thrown down table stakes (unit tests, basic documentation, high-level modularization with loose coupling, preservation of critical invariants, etc.)

I would actually turn this on its head, because I would rather have a comprehensible system without those table stakes features than an incomprehensible system with them. Of course, there is a feedback cycle between unit tests/docs/invariants and comprehensibility, but I have worked in systems with pretty good tests that were still very slow to work in because of their complexity.

Imo everything comes second to comprehensibility. Without it you are hosed.

vegetablepotpie · on Sept 11, 2023

I really like this document. This is the process I have settled on to debug issues, and I have found this approach to be universal.

Having such a finely articulated process has also been helpful to demonstrating to me why micromanagement in the business world completely destroys productivity.

Management hates, hates, hates this process to the depths of their souls because they would prefer a prescribed process that outlines well defined time-boxed steps to be articulated, from the get-go for many diverse disciplines to contribute to equally.

To them, Steps 1-3 should already by known and are unnecessary. Needing to go through them either means the issue is unimportant, the developer is incompetent, or there isn’t alignment between teams.

Step 4 and 5 means you’re flailing. You need to provide action, not ideas. “narrowing the search space” means assigning blame; healthy teams work together, they don’t assign blame. Developers attempting to break down the problem are disrupting the harmony of the work environment. This necessary step should be avoided.

Step 6 is the only step that matters. It’s more important to get to this step as quickly as possible ill informed than to arrive at it with data to support your conclusions.

Following these steps are strictly necessary in many circumstances. They’re also impossible to follow under scrutiny.

I say this to make a recommendation: Any good leader, to a development team, should insulate and shield their team from scrutiny so that they can carry these steps out rigorously rather than succumb to pressure and micromanage productivity into oblivion.

onetimeuse92304 · on Sept 12, 2023

Some important things missing.

One has to define what a bug is. For example, I have many times received a bug report for something that was perfectly normal system behaviour. The reasons for this are varied from users not understanding the system, the system behaviour being complex and causing emergent phenomena that is strange at first sight but completely fine when you start to analysing it to consequences of underspecified requirements, broken communication at the time of requirement gathering, etc.

It might still result in a bug, but this time a bug report for the original design.

Another very important thing missing is it is absolutely crucial to gather from the user what was the expected outcome. I can't begin to tell you how many times I had a developer "fix" the bug only to find out the fix is not what the user wanted to happen. Or that there wasn't even a bug in the first place and the fix actually broke the system for somebody else.

As to the forming and testing the hypothesis, the missing is that you can do it all by reading, understanding the code and simulating the problem all in your mind. I have probably fixed something like 70-80% of all bugs at this stage.

What you do is you form a hypothesis and then you start tracing the pieces of data through the code of the application (aided by your understading of it). If I can find something in the code that disproves my hypothesis, I have now saved time on setting up the whole test. But I also gained more understanding of the problem which I would probably not if I just let the application fail. Even if I actually found a solution, I can have better understanding how the piece of data interacts with parts of the system and I may now be aware how to write the fix or at the very least how to do it better.

philbo · on Sept 12, 2023

I like to have a step zero, that comes even before you try to reproduce the issue: get your head right, essentially.

If something's broken in production and the bug is hard to figure out, that often means stress. And stress is not good for helping you to solve problems. So take a few breaths, slow down, relax.

Slowing down to get into the right frame of mind helps get to the _right_ fix faster, because your panicked brain is not jumping to all kinds of speculative conclusions about what the solution might be. Be calm, observe, reflect.

Coincidentally, I wrote a blog post about this and my other steps for debugging not long ago:

https://philbooth.me/blog/how-to-debug

gniv · on Sept 12, 2023

This is great as a technical approach, but I would argue that sometimes it's not the right way. Debugging a difficult bug alone takes a lot of time and can be frustrating, especially if you're not familiar with the code base. I found that broadcasting about the bug to the extended team (if the team culture permits, of course) is more efficient. Somebody might understand that part of the project so well that they can tell you exactly where to look to solve it. Also, multiple people thinking a few moments about it and giving educated guesses can make a big difference.

kmoser · on Sept 12, 2023

> What were the recent changes?

If you narrow your search to most recent changes, you have a really good chance of finding the source of the bug. The culprit may even be you!

navane · on Sept 12, 2023

The one step I miss: what reasoning or assumption lead to this bug, and where could that have been repeated. This way you have a chance to sniff out bugs before they surface.

boberoni · on Sept 11, 2023

This blog post should be turned into a GitHub issue template.

skipkey · on Sept 11, 2023

The one thing I would add is, if at all possible, write a unit test that fails, reproducing the bug. I have found when working on a code base that is only lightly unit tested, it’s one of the best ways to increase coverage.

Brystephor · on Sept 11, 2023

100%. In general, if you're submitting a PR for a bug fix then an automated test (ideally a unit test) should be included in the PR that tests the bug fix.

syndicatedjelly · on Sept 12, 2023

Really well-written, and helps head off the anxiety some people feel when a new bug report rolls in.

082349872349872 · on Sept 11, 2023

see also https://lobste.rs/s/4iwbak/systematic_approach_debugging#c_s...