

The craziest bug you never heard about from 2008: A Linux threading regression - tc
http://timetobleed.com/digging-out-the-craziest-bug-you-never-heard-about-from-2008-a-linux-threading-regression/

======
JoeAltmaier
Furnace controller product - optimized energy consumption in a steel recycling
furnace. Customer reported the furnace got 'stuck' every couple of hours, the
electrodes quit moving so it continued to burn electricity (megawatts!) but
wasn't melting anything.

Turned out it was the 'power saving' setting on the PC. To save a couple of
watts it slowed down the CPU, and screwed up a furnace that burned megawatts.
Probably wasted more electricity in a day than the 'power save' feature saved
in the whole US for a year.

~~~
mattgreenrocks
One of my first professional projects out of school was a client/server system
using WinForms. I wrote the server, another guy wrote the client.

We worked hard and fleshed out a good design quickly. We iterated a few times,
then released. We soon got a major bug report from the field: the server would
lock itself up after some time! I was baffled; I didn't write much in the way
of threading code. I pored over the entire codebase several times looking for
any possibility of deadlock. Nothing.

Finally, I went out to the customer site when it happened again. Someone had
selected text in the console window the server was running on, which
effectively halted the process. I wrote a small wrapper around the server that
launched it as a service, and that fixed it.

(I probably should have written it as a service originally, but I was fresh
out of school, so I instead took it as a hard knock.)

~~~
csense
I hate it when Windows processes that really should output to the console,
don't.

Users should learn how to deal with console windows.

~~~
mattgreenrocks
Yeah, but the console subsystem is _ancient_. It's much easier to diagnose
problems with the console always up, but I'm happy to settle for a log file.

Also happy to not be writing servers for Windows anymore. :)

------
Uchikoma
Craziest to me is The Bug by Ellen Ullman.

Personal craziest was an Oracle performance problem on Windows NT in the 90s.
Slow as hell. Going to the server, logging in, checking everything: Blazing
fast. Back at the desk, slow as hell. Problem was the GL tube screen blanker
with software rendering :-)

~~~
brazzy
My craziest one turned out to be caused by World War 2.

Bug report from a bank said that a customer's birth date was not accepted when
trying to open an account - they'd tried and found that any data within a
range of about a month in the summer of 1945 was not accepted. This was a
German bank, and the application was written in Java.

I could reproduce the bug and found that the date was rejected at a very low
technical level in the Calendar class (long before any domain validation
happened), just as if you'd entered the 30th of February. Some debugging
sessions later I found that the Calendar class calculates a lot of internal
date and time fields, and the daylight savings time field containd a value of
2 hours, which was rejected by internal sanity checks.

The name of that field led me to a Java API bug report which explained
everything: The Locale for Germany is "centered" on Berlin, and in the summer
of 1945 Berlin and the Soviet-occupied zone of Germany actually did have a 2
hour daylight savings time (which happened to be identical to Moscow time).
Some smartass in the Java development team had "corrected" the sanity check in
Java 1.4 because he believed 1 hour DST to be the maximum - but Berlin is in
fact not the only timezone which had a 2 hour DST at one time or another. The
bug was fixed in Java 1.5

~~~
mutagen
Times and dates are such a minefield for developers. I really came to
appreciate this about 10 years ago when I dove into the Python datetime module
documentation to fix a 'simple' problem and saw how much work went into
getting things right. Combine that with how much of it comes up on RISKS
(comp.risks) and I try not to ever take dates and times for granted.

------
mguillemot
My craziest bug was a Java clock issue. If you had some specific model of
motherboard, calling System.getCurrentTimeMillis() repetitively could actually
make your system clock run faster. I mean, actually CHANGE the system clock.
For real. Like 10% faster. That led to veeeeery interesting issues related to
timing on the game I was working on, and of course it me took days before I
would even think that my problems could be caused by the time actually running
faster on different machines.

~~~
mzs
Reminds me of this:

[http://fxr.watson.org/fxr/source/i86pc/ml/locore.s?v=OPENSOL...](http://fxr.watson.org/fxr/source/i86pc/ml/locore.s?v=OPENSOLARIS;im=10#L1943)

------
LeonM
Wow, already down, that was a fast one ;)

Cached version
[http://webcache.googleusercontent.com/search?q=cache:http://...](http://webcache.googleusercontent.com/search?q=cache:http://timetobleed.com/digging-
out-the-craziest-bug-you-never-heard-about-from-2008-a-linux-threading-
regression/)

~~~
sp332
More permanent version:
[http://web.archive.org/web/20131208100429/http://timetobleed...](http://web.archive.org/web/20131208100429/http://timetobleed.com/digging-
out-the-craziest-bug-you-never-heard-about-from-2008-a-linux-threading-
regression/)

~~~
Toenex
Good to see that further permanence has been added. ;)

------
eloff
Luajit on Linux x64 uses MAP_32BIT so this bug will also affect it. Heaven
help you if you were using luajit and pthreads in the same process.

------
Moral_
Oe the site comes back up the author has a phenomenal analysis of the perf
events local root exploit from over the summer. If you're interested in
exploit development or just security bugs in the kernel his analysis is great.
He takes the time to explain things at a level that even the worst cs student
could understand.

~~~
chronomex
You can get the page itself from Coral Cache:
[http://timetobleed.com.nyud.net:8090/digging-out-the-
crazies...](http://timetobleed.com.nyud.net:8090/digging-out-the-craziest-bug-
you-never-heard-about-from-2008-a-linux-threading-regression/)

But the site uses a fully-qualified link to its stylesheet and so won't
render.

~~~
acqq
Full text, as I also prefer to read the text on the phone:

[http://pastebin.com/iTErn9fD](http://pastebin.com/iTErn9fD)

