The definitely most difficult bug I have ever worked on was a device that would have its flash memory erased from time to time. With couple of million of those in the field only about one hundred would be affected, every day meaning the frequency was enough to cause huge financial issue for the company (bricked devices to be destroyed and replaced, unhappy customers, PR nightmare) yet not frequently enough to be observed in lab environment.
We had to set up couple dozen of these doing operations 24/7 just to be able to note single occurrence of the problem maybe once a week or two.
The device was built in a way that made it impossible to observe physical lines between CPU and flash chip. This was intended as a security feature but caused the whole debug procedure to be extremely difficult.
The start of the problem could not be linked to any particular change in the software.
In the end we have tracked the problem to a single decision made a year before the problem started showing up.
The decision was to use UNLOCK BYPASS feature. UNLOCK is a special command sent to flash chip that kind of validates that the message was not garbled. UNLOCK BYPASS turns off the use of this feature. This was done to improve write performance as the UNLOCK command slows down writes. With the UNLOCK turned off, flash chip is more likely to interpret noise on its lines as a valid command.
This change did not immediately cause problems. Only later, another chip on the board started being used in a bit different way which caused much more noise to be generated. With the higher noise floor the flash chip would occasionally execute a command that was not sent from the CPU.
The irony here is that the rules for the construction of the device required that some of the signal lines are sandwiched in the inner layers of PCB between two layers of signal lines, to prevent easy access to the inner lines. With 4 layer PCB this prevented large ground planes that would help control induced noise that was causing the issue.
Whenever we tried to replicate the problem on a set of development devices (devices with the communication lines available for probing), the problem would not show up due to a different layout of the PCB.
In the end debugging the problem took about half a year.
Worked on a tool that used 3D trajectories in an analysis. Some of the output looked strange, but not exactly incorrect (there was no obviously correct answer). Looking at the trajectories used in the analysis, we started thinking some of them had to be wrong. We isolated the "most" incorrect ones and dug in to the code. After looking at the x, y, z components, that pointed us to a few functions. We found a function call with a typo. Instead of f(x,y,z) it was called with f(x,y,y). That one took a day or two to figure out.
Working on a tool that plotted satellite trajectories as part of results visualization, there was a strange jump in the orbit plot. Orbits before and after were fine. We eventually narrowed down the jump to crossing a specific date and time, and not at an obvious boundary (some day in the middle of the year if I remember correctly). There was no obvious reason for it to occur (no errors in our calculations or the input data). Eventually, we discovered that a leap second had been added on that date. The libraries we were relying on did not include that leap second since it had only been added recently, but the input data did. That was... frustrating. If I recall correctly, leap seconds are no longer added to time information (thankfully).
I feel this section is a bit neglected in terms of clarity and emphasis, it brushes on a lot of advanced details but looses sight of the basics important to a novice... Anecdotally but with more than one sample, the most common rudimentary mistake I see is someone attempting to isolate a bug upside-down, i.e poking around in small portions of code that do not comprise the whole system involved in the symptom, without any prior reason to be confident that the portion of code is involved.
In 99% of cases, starting with the whole system involved in reliably reproducing the symptom and then bisecting, or some guided divide and conqueror techniques etc, will get to the interesting parts of code fastest and with confidence that allows focus on elusive bugs. Yes this is not guaranteed to work, e.g notoriously difficult bugs with multiple, disparate factors or even timing issues will evade this technique, but you would still attempt it before resorting to more challenging methods.
Perhaps most teachers miss explicit emphasis of this because it seems so obvious to them, it seems implicit to the word isolate, but if I had to pick one single concept for novices, it would be this one.
It also highlights the importance of writing isolatable code. The simplest way to do that is to write code that doesn’t use any global variables, but this is something that even experienced developers can’t seem to wrap their heads around in my observation. Even languages like Java that don’t allow global variables end up having them simulated with dozens of “singletons” that are referenced everywhere that do things that make targeted troubleshooting impossible.
One of my favorite types of bugs in C is what my professor called 'unlucky bugs'
A bug the code where changing the arrangement of your code can with fix or create. Its possible if you have a static array initialized to the wrong length.
Changing the arrangement offsets the initalization segment of memory which is why this is possible.
The other side of this is having code that only works correctly because you've gotten lucky. For example, the variables on the stack that you're overwriting due to a buffer overflow are no longer being used at that point in the function.
Despite the name, it's not a "how to drive PDB/GDB/JDB/etc." course, but focuses on the higher level concepts of how to identify bugs and build tools that automate the debugging process.
One of the best debugging tools is to consider the application layers. Item 2 in the list "Stabilize, Isolate, and Minimize" touches on this, but to be more specific: consider your system as a call-stack or as an application stack. Can you find the last place in the system where things work as expected?
I needed a link to something like this while writing the JavaScript debugging tutorial for Chrome DevTools. I think I initially tried to start the tutorial off with conceptual information like this “How To Debug” article but eventually scrapped it because the conceptual preamble was longer than the tutorial itself.
Eh, what an ignorant point of view, sorry to be blunt.
Tests and assertions can only protect you from bugs you already assumed can happen, and they can protect you from the same bug happening again under the exactly same circumstances. They are completely useless for unexpected, unreproducible bugs which only happen when the stars align the right way.
Logging is a bit more useful but has some of the same problems, you only know what to log when you know what bugs to expect.
I mean using a debugger to step through your code.
For many years I have been a fan of debuggers. I thought you couldn't realistically program without a debugger, and I mocked people who used print to look at the values of variables at runtime (which is just a primitive form of logging). Then a friend of mine, who is much better programmer, told me he never uses a debugger. I do not really remember his arguments, something philosophical like the work spent inside the debugger leaves no trace and cannot be re-produced automatically; or that the runtime itself already steps through your code, no need to re-do that from elsewhere. Whatever, out of respect for the expertise of my friend I went debug-free and I cannot be happier.
Maybe the only context where I couldn't do without debugging is assembly programming. But there's been a few years since I last did that.
Your reasoning is pretty weak to say the least. Mostly just that your friend doesn't like debuggers. Hardly enough to make the claim: "Do not ever 'debug', it is losing your time."
Now to the original point, since when is it _either_ debugging _or_ logging? Both are entirely compatible. Secondly, I could say (and firmly believe) that logging is just a primitive version of debugging. I mean with logging you're stuck with looking at values you thought would be interesting when you looked at code, whereas with debuggers you are not limited by those previous assumptions.
> I do not really remember his arguments, something philosophical like the work spent inside the debugger leaves no trace and cannot be re-produced automatically; or that the runtime itself already steps through your code, no need to re-do that from elsewhere.
Work spent in a debugger is just as reproducible as work spent looking at logs. It's time you spent looking at information and thinking. Ignoring the fact that the two methods are in fact complementary, a debugger is like looking at logs, but having the ability to dynamically extend those same logs. I don't have any idea what you mean by runtime not needing to be re-done. I use logging and I use debugging. I often refine my logging based upon my experience. Some of that experience comes from looking at logs and errors while others come from debugging and looking that way.
I'm hesitant to try to summarize my philosophy in a sentence, but I guess it would be that stack traces/logs suffice when the execution state is of relatively low complexity to the task at hand (say the issue is basically a result of a bug fairly close to the problem or at least logically cleanly related) whereas a debugger is appropriate if the state is of relatively large importance (hence what you'd want is a "log" of current execution state which is basically a debugger).
TLDR: Use whatever makes you efficient at the task as hand just as everything else. There is no tool to rule them all.
That statement assumes that I wrote what I'm working on.
There are more important things to do than to carefully comb over or instrument thousands upon thousands of lines of other people's work (if that's even an option; it's very often impossible eg closed libraries) for the sake avoiding one of the most useful tools a developer has. When you're using a debugger, the number of assumptions that you have to make about what's happening drops to almost zero. If you are making no assumptions without a debugger or ridiculous levels of instrumentation, the software you are working on is of trivial size, full stop.
It is not actually a reasoning, but an adage. And I like it, as a general governing principle. Sometimes I have to recourse to debugging, but thanks to this adage this act is regarded as a momentary shameful failure. Trying to avoid this shame is a powerful motivation to code carefully.
I believe the total opposite is true. If I can use a debugger you can bet I am going to prefer that over log statements. In the debugger I can see the value of any variable on the entire stack and how they change over time. If I want to log that I would have to add _many_ log statements, trawl through the output and then go back and remove them afterwards.
Eh, in my experience "time spent in the debugger" correlates positively with how good someone is at programming, because the weakest programmers I have worked with have no idea how to drive a debugger and never use one. In single-threaded environments (like most unit tests) I think a debugger is an enormous productivity enhancer.
I think "time spent quickly making minor changes, re-compiling, and re-running the code" is a better indicator of weak debugging skills, but that still doesn't mean they are a "bad programmer". Often all they need is someone with stronger debugging skills to walk through an example debugging session with them, and teach them what techniques and tools they would use to approach the problem.
We had to set up couple dozen of these doing operations 24/7 just to be able to note single occurrence of the problem maybe once a week or two.
The device was built in a way that made it impossible to observe physical lines between CPU and flash chip. This was intended as a security feature but caused the whole debug procedure to be extremely difficult.
The start of the problem could not be linked to any particular change in the software.
In the end we have tracked the problem to a single decision made a year before the problem started showing up.
The decision was to use UNLOCK BYPASS feature. UNLOCK is a special command sent to flash chip that kind of validates that the message was not garbled. UNLOCK BYPASS turns off the use of this feature. This was done to improve write performance as the UNLOCK command slows down writes. With the UNLOCK turned off, flash chip is more likely to interpret noise on its lines as a valid command.
This change did not immediately cause problems. Only later, another chip on the board started being used in a bit different way which caused much more noise to be generated. With the higher noise floor the flash chip would occasionally execute a command that was not sent from the CPU.
The irony here is that the rules for the construction of the device required that some of the signal lines are sandwiched in the inner layers of PCB between two layers of signal lines, to prevent easy access to the inner lines. With 4 layer PCB this prevented large ground planes that would help control induced noise that was causing the issue.
Whenever we tried to replicate the problem on a set of development devices (devices with the communication lines available for probing), the problem would not show up due to a different layout of the PCB.
In the end debugging the problem took about half a year.