
How to Debug (2010) - wheresvic1
https://blog.regehr.org/archives/199
======
lmilcin
The definitely most difficult bug I have ever worked on was a device that
would have its flash memory erased from time to time. With couple of million
of those in the field only about one hundred would be affected, every day
meaning the frequency was enough to cause huge financial issue for the company
(bricked devices to be destroyed and replaced, unhappy customers, PR
nightmare) yet not frequently enough to be observed in lab environment.

We had to set up couple dozen of these doing operations 24/7 just to be able
to note single occurrence of the problem maybe once a week or two.

The device was built in a way that made it impossible to observe physical
lines between CPU and flash chip. This was intended as a security feature but
caused the whole debug procedure to be extremely difficult.

The start of the problem could not be linked to any particular change in the
software.

In the end we have tracked the problem to a single decision made a year before
the problem started showing up.

The decision was to use UNLOCK BYPASS feature. UNLOCK is a special command
sent to flash chip that kind of validates that the message was not garbled.
UNLOCK BYPASS turns off the use of this feature. This was done to improve
write performance as the UNLOCK command slows down writes. With the UNLOCK
turned off, flash chip is more likely to interpret noise on its lines as a
valid command.

This change did not immediately cause problems. Only later, another chip on
the board started being used in a bit different way which caused much more
noise to be generated. With the higher noise floor the flash chip would
occasionally execute a command that was not sent from the CPU.

The irony here is that the rules for the construction of the device required
that some of the signal lines are sandwiched in the inner layers of PCB
between two layers of signal lines, to prevent easy access to the inner lines.
With 4 layer PCB this prevented large ground planes that would help control
induced noise that was causing the issue.

Whenever we tried to replicate the problem on a set of development devices
(devices with the communication lines available for probing), the problem
would not show up due to a different layout of the PCB.

In the end debugging the problem took about half a year.

------
ubertakter
Two bug stories of limited usefulness:

Worked on a tool that used 3D trajectories in an analysis. Some of the output
looked strange, but not exactly incorrect (there was no obviously correct
answer). Looking at the trajectories used in the analysis, we started thinking
some of them had to be wrong. We isolated the "most" incorrect ones and dug in
to the code. After looking at the x, y, z components, that pointed us to a few
functions. We found a function call with a typo. Instead of f(x,y,z) it was
called with f(x,y,y). That one took a day or two to figure out.

Working on a tool that plotted satellite trajectories as part of results
visualization, there was a strange jump in the orbit plot. Orbits before and
after were fine. We eventually narrowed down the jump to crossing a specific
date and time, and not at an obvious boundary (some day in the middle of the
year if I remember correctly). There was no obvious reason for it to occur (no
errors in our calculations or the input data). Eventually, we discovered that
a leap second had been added on that date. The libraries we were relying on
did not include that leap second since it had only been added recently, but
the input data did. That was... frustrating. If I recall correctly, leap
seconds are no longer added to time information (thankfully).

------
tomxor
> 2\. Stabilize, Isolate, and Minimize

I feel this section is a bit neglected in terms of clarity and emphasis, it
brushes on a lot of advanced details but looses sight of the basics important
to a novice... Anecdotally but with more than one sample, the most common
rudimentary mistake I see is someone attempting to isolate a bug upside-down,
i.e poking around in small portions of code that do not comprise the whole
system involved in the symptom, without any prior reason to be confident that
the portion of code is involved.

In 99% of cases, starting with the whole system involved in reliably
reproducing the symptom and then bisecting, or some guided divide and
conqueror techniques etc, will get to the interesting parts of code fastest
and with confidence that allows focus on elusive bugs. Yes this is not
guaranteed to work, e.g notoriously difficult bugs with multiple, disparate
factors or even timing issues will evade this technique, but you would still
attempt it before resorting to more challenging methods.

Perhaps most teachers miss explicit emphasis of this because it seems so
obvious to them, it seems implicit to the word isolate, but if I had to pick
one single concept for novices, it would be this one.

~~~
commandlinefan
It also highlights the importance of writing isolatable code. The simplest way
to do that is to write code that doesn’t use any global variables, but this is
something that even experienced developers can’t seem to wrap their heads
around in my observation. Even languages like Java that don’t allow global
variables end up having them simulated with dozens of “singletons” that are
referenced everywhere that do things that make targeted troubleshooting
impossible.

------
yadaeno
One of my favorite types of bugs in C is what my professor called 'unlucky
bugs'

A bug the code where changing the arrangement of your code can with fix or
create. Its possible if you have a static array initialized to the wrong
length.

Changing the arrangement offsets the initalization segment of memory which is
why this is possible.

~~~
greenyoda
The other side of this is having code that only works correctly because you've
gotten lucky. For example, the variables on the stack that you're overwriting
due to a buffer overflow are no longer being used at that point in the
function.

~~~
2rsf
and a third variation are 2 or bugs that cancel themselves, for example you
overwrite a variable but never use it (where you should have)

------
alexhutcheson
If you're interested in a deeper dive into this, Andreas Zeller's Udacity
course is excellent: [https://www.udacity.com/course/software-debugging--
cs259](https://www.udacity.com/course/software-debugging--cs259)

Despite the name, it's not a "how to drive PDB/GDB/JDB/etc." course, but
focuses on the higher level concepts of how to identify bugs and build tools
that automate the debugging process.

------
csours
One of the best debugging tools is to consider the application layers. Item 2
in the list "Stabilize, Isolate, and Minimize" touches on this, but to be more
specific: consider your system as a call-stack or as an application stack. Can
you find the last place in the system where things work as expected?

------
kaycebasques
I needed a link to something like this while writing the JavaScript debugging
tutorial for Chrome DevTools. I think I initially tried to start the tutorial
off with conceptual information like this “How To Debug” article but
eventually scrapped it because the conceptual preamble was longer than the
tutorial itself.

------
tacon
Steve Litt has a Universal Troubleshooting Process which he has been promoting
since 1996:

[http://www.troubleshooters.com/tuni.htm](http://www.troubleshooters.com/tuni.htm)

------
commandlinefan
> pretty close to working and break it worse and worse

I wish I could say the last time I did with was when I was in college...

------
ohiovr
My answer is usually in the error. If not I made a design flaw.

~~~
ohiovr
Well it works better than osmosis..

------
enriquto
> How do Debug

Don't.

Write tests, assertions and logs to understand what happened when something
went wrong. Do not ever 'debug', it is losing your time.

~~~
jhayward
I have a friend who is the best programmer I know who says he used time spent
in the debugger as a red flag that someone is bad at programming.

He writes lots of validation/logging features in his code. Not a whole lot of
unit tests, per se, that I've seen.

~~~
sjf
I believe the total opposite is true. If I can use a debugger you can bet I am
going to prefer that over log statements. In the debugger I can see the value
of any variable on the entire stack and how they change over time. If I want
to log that I would have to add _many_ log statements, trawl through the
output and then go back and remove them afterwards.

