

Slow bugs - kabouseng
http://www.embeddedfool.net/

======
blrgeek
Heisenbugs: When you test for them they don't appear anymore. (Add a printf,
et voila the race condition is gone). - I've personally experienced this one.
Race condition/deadlock between 3 different threads. One mutex was inside the
debug function :)

[http://www.catb.org/jargon/html/H/heisenbug.html](http://www.catb.org/jargon/html/H/heisenbug.html)

Schrödinbug: The code should/could have never worked in the first place, but
did. Once you saw the code, and realized it shouldn't work, it stops working.

[http://www.catb.org/jargon/html/S/schroedinbug.html](http://www.catb.org/jargon/html/S/schroedinbug.html)

------
pavlov
The "interstellarbug"... At the center of the entire system is an enormous
black hole of collapsed mystery code. It is unimaginably dense with unknown,
unmeasurable bugs and virtual machines whose purpose cannot be deduced from
the outside. Its API horizon must never be crossed.

A working subsystem orbits the black hole and somehow manages to perform its
operations, but it's extremely hazardous to visit even that subsystem: due to
the enormous gravitation of the black hole bugs, every hour you spend working
on the code makes you miss five years of your real life.

~~~
coderjames
Haha, that's great! We have that situation in our main product line where I
work.

At the center of the system is 30 year old Z8000 assembly code that gets run
through a translator during the build process to produce C code which gets
compiled and linked into the (PowerPC) executable.

There aren't that many folks left that know Z8K assembly, so not many people
can begin to wade into that code and fix long-standing bugs. Everyone else
builds wrappers so they don't have to actually get too close to the translated
assembly code. But since they don't quite understand the code they're
avoiding, these wrappers tend to be fragile and prone to odd corner-case bugs
themselves.

------
sillysaurus3
One effective way of dealing with a slow bug (or "heisenbug") is to rewrite
the system. That sounds crazy, and perhaps it is crazy in a company
environment. But when I was first learning to program, I kept introducing
heisenbugs once every couple weeks or so. After spending 30 minutes trying to
debug it, simply deleting the entire module and rewriting it was unreasonably
effective. The total time investment was about 60 minutes.

I haven't encountered a heisenbug in a long time though. Maybe years. I'd like
to think heisenbugs are inversely proportional to skill, but maybe it's just
luck.

~~~
fest
While that could work for small software systems (or small modules which you
can rewrite), it's simply not viable solution for hardware bugs (like the one
GP described) or large software projects which are mostly in the state of
perpetual mess and horrible dependency graph.

This actually reminds me a type of bug every electronics novice has seen- a
circuit does not work unless you touch it with your finger (or even place your
finger next to circuit). Some of these bugs you can solve by rewriting the
software (enable internal pullup/pulldown resistor) but some others you can't
solve without soldering iron (missing GND connection).

------
gbrown
Your probability calculations are... non standard. Although you get the right
answer, it's not correct to treat a discrete probability distribution like
it's continuous. A more standard approach would be to calculate, for an
appropriate time unit, the single time event probability and apply a geometric
distribution.

Edit: you're also seeming to mix up the time to event with the number of
events in an hour. Did you actually use your derivation to get to your time
estimates, or did you do a simulation?

------
eterm
The probability density graph annoyed me! If the bug is as likely to be found
in the first munute as it is during the second given it wasn't found during
the first then it's actually most likely to be found early and would look like
the third graph, such a thing doesn't require it being a start up bug, just a
bug driven by randomness so it is a poisson process.

A flat density would suggest a limit to how long it would take, since the area
cannot be infinite.

~~~
kabouseng
Hi, my apologies if it isn't clear.

Another way to look at it is with a dice roll analogy. Every time you roll a
dice, the odds to get a certain number is 1/6\. The next roll the odds is the
same, regardless of how many times you rolled the dice before. That is what
the first graph is supposed to illustrate.

To get successive throw's of a certain number, that is a different question,
and indeed is why you calculate the cumulative probability density function,
or graph. And that is indeed the 4th graph.

Any suggestions on how I can make it clearer or more intuitive, or indeed if I
have made a mistake?

~~~
eterm
Yeah I think I understood the intention. In that case it's a plot of
conditional probability rather than probability density.

I wouldn't change the article, it's clear enough and it's not like they're
accurately plotted graphs, just sketches to get your idea across.

Edit: By the way there is a distribution that describes something which is
similar to a poisson distribution but with a changing rate, it is typically
used for failure rate analysis (time before failure modelling) but could also
be used here to describe time before bug discovery:
[http://en.wikipedia.org/wiki/Weibull_distribution](http://en.wikipedia.org/wiki/Weibull_distribution)

~~~
gbrown
He's also mixing up discrete and continuous probability distributions. He got
pretty much the right answer, but as a statistician it kind of hurts to look
at.

------
MichaelCrawford
The author mentioned bugs that are more likely to occur sooner rather than
later, such as startup bugs (errors in the boot code or kernel
initialization).

One can also get shutdown bugs. If you like to brag about how long your server
stays up, you'll never see them, but if you shut down your box at the end of
your workday, you're likely to.

I myself isolated a bug in the Classic Mac OS 7.5.2 (or maybe it was 7.5.3)
Open Transport Ethernet shutdown procedure. The way I found the bug was to
write an AppleScript that consisted of "Tell Finder Restart", then placed that
in the Startup Items folder.

I very quickly found that the bug occurred on networked computers, but not on
those which were not connected.

For *NIX you could test the kernel's startup and shutdown by having an init
script that did "shutdown -r now".

I don't ever hear about others testing this way.

I didn't invent this, it was used by another department in my same building to
test development builds of System 7. They had a couple hundred Macs many of
them would do nothing but reboot 24/7.

------
snake_plissken
I'm chasing something like this right now that started very recently. We have
a handful of telematic devices out in the field that will just randomly stop
sending information; no power disconnect events, no losing network
connectivity events, nothing to indicate there are any issues. One day a
device is fine, then the next it isn't. What makes it even more aggravating is
that some of them randomly come back online and shoot over all of the missing
date from when they went dark, but apparently still functioned as intended
during the interim.

At the moment I have no idea how to reproduce the issue but I have a few
theories as to why it might happen: lithium ion backup batteries finally going
bad, cold weather affecting the batteries, or an incompatibility with the
device configuration code and its current firmware revision.

------
ajb
An interesting exercise. Why do you think an increasing probability
distribution makes this equivalent to the halting problem, though?

Without loss of generality, an increasing probability is >= some P after some
time T. Being conservative, we can assume the failure probability is zero
before T and ==P after T. Of course, we don't actually know P or T, but they
must be consistent with us having seen the bug at all. So we can use
uninformative priors to define the posterior distribution of these values
given the number of times we have seen the bug and the amount of time required
in each case. Unless I'm missing something?

I once wrote a probabilistic version of git bisect
([https://github.com/Ealdwulf/bbchop](https://github.com/Ealdwulf/bbchop)) for
use with intermittent bugs, but it hasn't seen real use.

------
ilitirit
I encountered a very annoying-to-track-down "pseudo" Heisenbug in a mobile
application I'm working on a few weeks ago.

Once you logon to the application, it downloads product data (ID, description,
quantity) for sales reps for that day. To cut a long story short, I'd just
finished implementing a new piece of functionality and I logged on to the app
to test it. The application started behaving very unexpectedly. I double-
checked my code, but couldn't see anything that make it act the way it did.
Then I tried logging on as a different user, et voila! It worked! Tried the
old user and the problem occurred again. Tried a 3rd user and I got different
behaviour again, but not the same as with the original user I was testing
with. I thought it had to be a data issue, so I uninstalled the application
and cleared the cache and app data, reinstalled, and the bug was gone.
Whatever I tried I could not replicate the problem. I shrugged it off as local
database corruption and continued working.

A few days later, the same thing happened. Same symptoms, but this time the
bug disappeared without me even having to reinstall the app. After a few hours
of frustrating debugging and source code reviews, I remembered something. I
ran into the problem at the same time of the morning as I did before - around
12am. It turns out that the process that populates the server with daily
product information that usually ran at 11pm was rescheduled to run at 12am
(we weren't informed about this change). It ran for 5 to 8 mins. At the time I
logged on, the data for the first user was not available yet, but the data for
the second user was. To complicate things, if the server can't find daily
product information for that day, it sends data from the previous week (the
products are essentially the same, but the quantities may be slightly
different and the IDs are tied to older stock batches). But the user I was
testing with didn't exist 7 days ago so he didn't have data to fall back on.
The 3rd user's daily product data also wasn't available yet, but he did exist
7 days ago so he received data with product IDs that I did not expect.

In the end I "fixed" the problem using a scheduled task that emailed tech
support if the product data was not in our databases by 11:30pm.

------
julie1
FSM are anolgous to cellular automata, and since it is non linear algebrae, it
is impossible to predict a finite sequence. FSM do indeed have most of the
time regular basin of attraction with finite sequences, but there are chaotic
evolutions that may results in finite (and longer than usual) sequences and
infinite sequences without repetitions (given an infinite playground or CBP
does the job).

And we put a lot of this chaotic systems in our software design.

The definition of a complex system is: a lot of simple system interacting with
one an another. The physics and math of this typology of problem concluded
that it is pretty much a non linear, non euclidean problem.

So far, what we know about these beasts is they are non predictible, BUT they
are robusts around their equilibrum if not pertubated too much.

The stregnth of the beast is the network of connection but also its weakness.

If a certain one a perturbation happens, it can bring down everything (the
butterlfy effect). It happens with a frequence and an odd we can't predict.
And then it behaves in a way we can't predict.

So it is basically the ground of the internet: a non deterministic system used
the same way a deterministic system is. What can go wrong?

We can't predict? Thus it will not happen anyway, so let's go back coding.

