
Ask HN: What was the worst bug you've ever solved ? - jacquesm
What is the worst bug (the software kind) that you've encountered in your hacking career ?
======
mooders
Back when text messaging capability was a rarity on mobile phones, which were
themselves rare, I was testing an SMS-based weather forecast service that I
had written on behalf of one of the mobile network operators.

The testing worked well on the emulator so I decided to test it over the
public network to an actual handset. Only I forgot to advance a recordset
through which I was looping, so the code never hit the end of recordset
condition. It took me some time to notice there was a problem...

The fact that I crippled a national SMS network for a few hours was bad.

The fact that my company had to pay for each SMS, wiping out out profit for
that month was worse.

The fact the handset was mine and on my first date with a girl later that
evening (whom I later married) my handset kept beeping with incoming text
messages (about 96,000 if I remember) was the ultimate.

The handset didn't have a silent-no vibrate function (either it beeped or it
vibrated or it did both) and the SMS inbox filled up after 200 or so messages
meant it took days for the inbox to fill up, me to clear it message by
message, then fill up again ad nauseam.

Still, I laugh about it now...

~~~
ajju
One of my colleagues did this with our automated notification system. After
his phone received about $60 worth of text messages, he panicked and shut down
the server!

Then again $60 is only about 1200 messages.

~~~
yan
> Then again $60 is only about 1200 messages.

You must not be in the US..

~~~
elblanco
Yeah, that's like $10,000 to $20,000 in text messages.

~~~
ajju
What? Are you saying it costs nearly $10 per message or more? How can 1200
messages cost $10K - $20k?

~~~
yan
That's what we call "speaking in hyperbole"

------
gdp
I inherited a giant hideous stock-management system. It did a certain amount
of automated ordering without manual intervention.

Long story short: a nasty race condition meant that it was over-ordering
duplicate products from the suppliers to the tune of tens of thousands of
dollars per day.

On the general theme, my most frightening software experience was when I met a
guy who was the star programmer for a company doing controllers for elevators.
I got talking to him, and he showed me some code. It took me about 15 seconds
to identify an edge case that would engage the motor while the elevator was at
the top floor (thereby attempting to pull the elevator into the roof?). It
took me a further hour to explain what the edge case was. The bug wasn't that
scary, as I'm sure there are hardware failsafes, but the general dimness of
the guy writing software to control lifts was scary.

I started taking the stairs after that.

------
cmos
In high school I got a job for my local Department of Public works in the
Power division. I lived in a small New England town that did their own power,
much like most towns do their own water and sewer.

My job was as an assistant to the inventory guy, a 70 year old feisty man with
one hand named Al.

I was often bored when Al would tell me to 'go hide somewhere' so I wrote some
software to help him manage the inventory system. The power engineers in
charge saw this and after a few small programming assignments had me work on
updating the newly installed SCADA control system. This was a specialized
programming environment that controlled all the power in the town.

We were setting it up to buy power from the local college during the yearly
'peaks' in August, thus reducing our yearly electrical bill by potentially
millions of dollars.

After a month of working on it and incrementally adding my changes I screwed
up. I knew this when I submitted a change and all the alarms went off at the
substation.

Half the towns power was out. I got it back on after an hour, and nobody
called with ventilator issues, so I think there was no real harm done.

The engineer in charge of the department laughed it off when he saw my
apprehension about the situation. He said in the grand scheme of things they
have made far bigger mistakes than that, probably referring to the blown
transformer a couple months earlier.

~~~
imp
That must have been one scary hour.

------
edw519
A batch run of only a few thousand items was running all night long, rarely
finishing and causing all kinds of problems when people logged in in the
morning. The users had been complaining about this for years.

I was given the ticket and found a "SLEEP 10" (causing a 10 second pause for
each item) in the 10 year old BASIC code, put in by the original programmer
for debugging purposes, and never removed before it was promoted.

I removed the "SLEEP 10" and run time went from 12 hours to 23 seconds.

The users loved me, but my boss was not pleased. He said, "You should have
changed it to a SLEEP 5 so we had something else to give them the next time
they complained."

~~~
bendtheblock
Perhaps it was intentional on the original programmer's part... but he never
got round to reducing the sleep time - <http://thedailywtf.com/Articles/The-
Speedup-Loop.aspx>

------
ilitirit
Not so much a bug, but this incident cut my lifespan by about 5 years or so:

I was fixing an account balance on a customer's master database. Any change
that happens on the master gets replicated to 30 branches, usually at 10
minute intervals.

I wrote the UPDATE statement, highlighted it, and pressed "Execute".
Unfortunately, I didn't select the WHERE clause, so I basically gave all their
customers (85000) a 50c credit balance. The IDE I was using also had a bug
that caused it to ignore the auto-commit setting (which was turned off), so it
basically committed the transaction. First I tried a ROLLBACK, which failed
obviously. Realizing I had to act quickly, I disconnected the network cable (I
was working on the DBMS server) to stop replication. I extracted the
transaction log from the current database into a textfile (a few hundred MB)
and restored the database from the most recent backup. Then I basically ran
the extracted transaction log as SQL scripts against the database, hoping that
it wouldn't fall over. I didn't want to do a normal restore because I was
afraid of it showing up in the logs.

Within a very stressful 60 minutes I had everything back to normal. I never
told a soul IRL about it.

~~~
jacquesm
Which is one of my pet gripes with SQL, that the default is to affect all
records on update, that should have been:

UPDATE tablename SET xxx='newvalue' WHERE ALLRECORDS;

Or something to that effect.

~~~
duskwuff
MySQL has --safe-updates (aka --i-am-a-dummy), which will refuse to execute
UPDATE and DELETE statements without a key in the WHERE clause.

~~~
jacquesm
Would you feel the same if in unix the default for 'rm' was 'rm -rf *' ?

------
Kirby
For a different definition of worst:

I started a job recently at an ecommerce company. There was a long-standing
bug with the cart display in the upper right of the page always saying that
the cart was empty. People would report it all the time, and the quite smart
lead programmer said it was something really complicated that he hadn't had
time to investigate.

But eventually, after I sort of knew my way around the code, and when I
finished up all the tasks on my to-do list, he handed me that as a why-not-
investigate sort of thing. He didn't really have any idea, just that the
previous long-gone coder had said it was complicated and in the depths of the
way the front end code interacted with the order system.

So, I reproduce it, and look at the template code. These two lines, right next
to each other:

[% cart_summ = ourdb.cart_summary %] [% IF cart_cumm.qty > 0 %]

Note that the two variables don't match. And this was broken on the site for
_FOUR YEARS_. And nobody looked because someone said he had and it was hard,
and nobody had time for a hard problem.

 _facepalm_

~~~
tetha
hehe. reasons to develop YOUR_LANGUAGE-lint?

------
nomurrcy
Some users of a (shipped, fairly heavily used) web app we had deployed were
getting kicked back to the login screen at random. Sometimes, very frequently.

Looking in the logs we could see that these users were somehow losing their
authentication cookie and the application was correctly bouncing them to
login. So how were they losing their cookie? Assuming it was a bug in the code
we searched and searched to no avail.

Finally I discovered that the hardware load-balancer our CTO/'IT' guy' had
insisted on was the culprit. The load balancer would buffer fragmented
requests and re-assemble them before sending them on to the server.
Unfortunatly the load balancer had a huge bug in its firmare.

If a user was using firefox, on windows, and their request was fragmented such
that the first packet contained data up to the end of a header line including
the \r but not the \n, so the next packet would start with a \n and then a
header name, the load balancer would insert a \n\r between the two packets,
thus effectively truncating the HTTP headers, usually before the cookie lines.

When I found this bug I couldn't believe that this was actually happening, I
thought I was taking crazy pills, but you could run a sniff on the front and
back side of the load balancer and see the request go in well formed and come
out busted. We ditched the hardware load balancer and all was well.

------
tjr
Was converting an avionics subsystem from Ada to C. It was a client
application that had to talk to an Ada server, sending and receiving rather
huge chunks of data, large, deeply nested, intricate structure types. The C
structure type had to match the Ada type exactly, or else it wouldn't work.

I got it working fine on our desktop simulation, but running on the actual
hardware it was consistently off. After extensive testing, I realized that it
was a bug in the compiler for the target hardware, such that a very particular
type of structure (something like, {int, char, float}) was being packed
incorrectly, resulting in a 2-byte pad that shouldn't be there. If I reordered
the structure elements, it was fine, but that particular grouping and order
refused to work correctly.

It was GCC, so we could fix the compiler ourselves, right? Not really, as, for
avionics systems the compiler has to be thoroughly qualified for avionics use,
and changes equal requalification. I "fixed" it by storing the float as an
array of characters, converting it to and from a real float type as we needed
to use the data value.

Trivial, perhaps, but I was very excited to resolve the problem, after
spending days barking up wrong trees. One usually expects that the problem is
not in the compiler... :-)

~~~
vicaya
Were you using GCC's __attribute__((__packed__))?

Anyway, the standard way (to handle protocols) is to parse the thing not
making any assumption of the struct layout.

~~~
tjr
Yes. The other structures all packed correctly.

If I am understanding what you are saying, we really couldn't do that, as the
server sent and expected to receive binary blobs of data; the only way to know
what was what was to have a map of where the data elements were.

~~~
joe_bleau
I like to use macros (in C) to spit out the structure size and offsets of each
structure member over a serial port, in a format I can then cut-n-paste back
into the source. This output consists of a bunch of (compile-time, when
possible) assertions so that any changes to a structure break the build. These
assertions go on both the embedded side and the PC server side, so any weird
packing issues show up at compile time.

------
f00
Bug with the most spectacular results:

As a (former) hardware engineer, I've worked on many projects where bugs have
physical effects. This can range anywhere from amusing to seriously dangerous.

One such bug involved a mistake in the assembly diagram and silkscreen for a
circuit board. The result was that a tantalum capacitor was installed
backwards on a 12V supply rail.

Tantalum capacitors are polarized, and they fail in a spectacular way when
reverse-biased. In this case, the supply rail could source upwards of 20A, so
the fireworks were loud and impressive. Luckily the cap was easily replaced
and the only permanent damage was cosmetic.

Hardest-to-troubleshoot bug:

In my subsequent return to the world of software, I worked on device drivers
for network interfaces (among other things).

NICs frequently operate through a circularly-linked list of packet
descriptors, which contain pointers to buffers in RAM where the NIC can DMA
packet data. The hardware fills the DMA buffers and marks the descriptor as
"used," and the driver chases the NIC around the ring, processing the packet
data and marking the descriptors as free.

In testing, we discovered that under long periods (hours, usually) of heavy
load, the system would occasionally freak out and stop processing packets.
Sometime later, various software modules would crash.

Working backwards through the post-mortem data, I saw that the NIC would get
"lost" and dump packet data all over system memory. I dumped the descriptor
ring (tens-of-thousands of entries) and wrote some scripts to check it for
consistency.

To make a very long story short, when the NIC was stormed with lots of 64b
packets with no gaps, it would eventually screw up a DMA transfer and corrupt
the "next" pointer in the descriptor ring. On the subsequent trip through the
ring, the NIC would chase an errant pointer off into system memory and corrupt
other system data structures.

Since hardware can DMA anywhere in RAM, the OS is powerless to stop it. The
resulting errors can be ridiculously hard to track down and fix.

------
tedshroyer
Had an obscure picking id wrap around because a table wasn't getting cleared
for debugging purposes which resulted in excessive amounts of beer being
delivered to unsuspecting customers at an automated gas station.

Here's a video of part of the result:
<http://www.youtube.com/watch?v=RUhLDtPnSuQ>

~~~
ajju
A bug that gives out free beer. Talk about a bug with a silver lining :)

------
tlb
Runaway robots at Anybots have caused:

    
    
      - 2 holes in drywall
      - 1 bent bookshelf
      - 1 dent in concrete floor
      - 1 frightened Jessica
      - http://www.youtube.com/watch?v=qkenIInV9rI
    

The last one was fun because I have logs showing packets from the PC/104
computer stack (running FreeBSD) connected to the robot while it was in
midair.

~~~
jacquesm
How do you dent a concrete floor with a robot ?

~~~
tlb
Monty weighs 160 lbs, and at the time had an over-designed metal piece for its
neck. A sensor failed and it faceplanted, taking a 1 cm deep chunk out of the
concrete. It's still there, under some carpet.

~~~
jacquesm
Sure doesn't look that heavy, maybe I'm misjudging the scale here, I'll have
another look at that video.

Nice balancing by the way, smooth dampening.

Edit: Ah yes, now I see it, that's a small scope and a lab powersupply in the
background, I estimated the size off by a lot.

I figured the whole thing was about a foot tall or so, sorry!

------
__david__
The most memorable bugs are the ones that cause physical damage. This was
mine:

<http://www.youtube.com/watch?v=b7i2KkYYulI>

Damage: Blown tire, dented rim, looking like fools in front of our peers.

~~~
plaes
Um.. what was that? Some autonomous vehicle navigation system?

~~~
jacquesm
Darpa Urban Challenge, top right of the video.

~~~
thorax
What was the bug?

~~~
__david__
Depends on your point of view. :-) Either:

(a) The CAN bus (which reports wheel velocity and position of the steering
wheel among other things) micro-controller hardware stopped sending interrupts
which caused the main computer to think we were stopped. As far as I can tell
this is just a straight up hardware bug with the Philips ARM chip we were
using. This caused the accelerator controller to floor it because it was only
seeing the last CAN message we ever got which happened to be zero velocity.
Same thing with the steering (hence the big swerve).

or

(b) I failed to consider the contingency of not getting any CAN interrupts
(either because the of the [very intermittent] hardware bug or because the
connector got disconnected--turns out the symptoms are the same) and didn't
have any code written to deal with it. Guess what I wrote that night. :-)
Luckily I had all day to think about it while a new tire was fetched and by
the time I figured out what was going on it took about 10 minutes to code the
fix: a watchdog timer that shuts the world down if there are no inputs from
the sensors for any reason.

It seems obvious in retrospect but when things are working you sometimes
forget about weird failure cases.

~~~
jacquesm
Any chance of getting 'second opinion' style sensors in there that can provide
you with sanity checks ? Such as 'GPS reports movement, but wheel sensors do
not, we have a problem ?'

That way you can avoid a paralysis of the control software until the vehicle
has really come to a halt.

~~~
weaksauce
I read somewhere about realtime applications that do something like this but
redundant sensors holding a consensus polling algorithm. Have three sensors
reporting the same thing and if they are not all in agreement within some kind
of delta then go into some kind of limp mode or have the two sensors in
agreement be the ones that the system uses for it's algorithms. I cannot
recall where I read it though.

~~~
Splines
Me too. The back of my brain is telling me it was for some sort of plane
control software? Maybe? An interesting tidbit that I recall was that they
used different manufacturers to hedge their bets against bugs.

~~~
weaksauce
That sounds about right. I seem to recall that it was for aerospace too. Maybe
NASA? Something about zero defect software. I cannot for the life of me find
the article right now though. I also vaguely remember that it was a HN
submission too.

------
shelfoo
Two immediately jump to mind. One that had a massively bad impact to the
company, another that might have..

First, using perl a (later-fired) co-worker added a hardcoded check like the
following:

if ($client_id = "specific_id") { #email reports }

Needless to say, we emailed reports for all of our clients to a specific
client, didn't go over too well considering that many of them were
competitors.. It was particularly bad because he had previously been talked to
about flipping the constants to avoid the = vs == bug.

Second, possibly abused but not known for sure, was found a few _years_ after
initially being put out. Our webapp created a session ID for each user, MD5
hash.

Except it started like: StringBuffer md5HashedBuffer = new
StringBuffer(userId);

Which, because the userId was an int, simply creates a string buffer of size
userId, not a string buffer initially populated with userId.

The rest of the hash was added afterwards, then the one-time created, with the
result that everybody's session id was the same. Changing your user id in the
GET or POST would allow you to be logged in as a different user.

~~~
mleonhard
Why weren't both of these bugs caught during code reviews?

------
bkz
After having launched our product I was spending some time reviewing commits
together with the senior tech lead. Still to this day I can recall the commit
number, the filename and write down the code from memory responsible for what
turned out to be the source of a bug completely wiping out our users
computers. Someone had mixed uncommenting a piece of code together with fixing
a bug which hid the fact that some horrible code was active in the product. It
took us 5 minutes to produce a fix and push it out to the update servers. Did
we end up wiping someones computer? Yup, about a dozen known cases including a
couple one in-house. I don't even want to think about how many actual cases
there were, considering that we had about 2 million downloads of our product
before the bug was fixed.

------
Mark_B
A while back, I developed a program to generate invoices for about a dozen
busy warehouses. During testing, for convenience sake, I hard-coded in my
local printer.

Unfortunately, I forgot to return the printer name to a variable when
promoting into production. Hilarity ensued.

~~~
jacquesm
Hehe, that one had me laughing here. Ouch. Hope you put enough paper in it ;)

~~~
newsdog
Better one.

Guy I knew - awesomely good - hex edited a DOS boot sector on a in house
machine to use FUCK.SYS instead of whatever.sys it normally is (I forget). He
renamed that file to fuck.sys, rebooted and the machine ran. Cool!

We laughed and reinstalled DOS and two days later the boss come charging in
yelling 'I have a client on the phone who says his new machine can't find
FUCK.SYS!'

The awesome guy goes 'uh oh'. I laughed.

~~~
jownz
config.sys

------
masterponomo
The worst bug I encountered was due to IBM MVS (or COBOL--I was never sure
which was at fault) losing addressability of part of a variable length record.
Now you see it, now you don't. The solution at our shop was to move the whole
record to itself before attempting to look into the record. I was a newbie. If
the old guard hadn't told me that workaround, I NEVER would have thought of
it. This problem eventually went away, but 25 years later we still
occasionally ask each other "did you try moving it to itself" when dealing
with new problems. We chuckle, while today's newbies shake their heads at our
Old Fart humor.

The worst one I ever caused was when Visa started carrying two amount fields
in their credit card records. One was the amount in original currency, the
other was converted to the receiving system's local currency. I used the
original amount. Our hand-made test data used the same currency for both
amounts, so no problem in test. Imagine my surprise when we went live and our
system started posting original currency amounts to cardholder accounts, which
at the time only supported US dollars. Luckily, we caught it early and senior
management and cardholders were all good sports about it. I think those credit
card statements with massive amounts became collector's items.

~~~
dandrews
Since you mention COBOL records were variable-length I'd guess that they
probably contained ODOs (Occurs-Depending-On, variable length arrays for those
of you who aren't COBOL literate). In order for a group or record MOVE to work
properly you had to move the subordinate ODO values first, otherwise the
runtime system would miscalculate the target record length, possibly
truncating the MOVE.

This also meant you couldn't use READ INTO for variable length records (which
is equivalent to a READ followed by a MOVE) without taking some care.

As you say: "newbies shake their heads..."

------
humbledrone
I was working on a C++ daemon process that communicated over a TCP socket. At
the time, we were using the Poco library's facilities to do the standard
daemon startup stuff (get rid of the controlling pty, point standard fds to
/dev/null, etc). Anyway, one of our field installations wasn't working, so I
took a look. It turned out that the communications over the TCP socket weren't
working -- where the client process expected a few header bytes containing the
message length, it was getting wacky values. I tried a bunch of stuff, and in
the end, I displayed the header as ASCII, and it showed up as "SQL: INS". This
blew me away; this looked like some debugging output that normally goes to
standard output when the process wasn't running in daemon mode.

As it turns out, the Poco library didn't read Steven's UNIX book all that
closely, and they _closed_ all of the file descriptors when turning a process
into a daemon, instead of reopening them to point to /dev/null. So, standard
output was closed, and its file descriptor was reused for the TCP socket. Of
course, things like "cout" always assume that standard output is at a
particular descriptor, so all the standard output from the program was getting
written to the TCP socket.

Boy, that was confusing.

------
joezydeco
I worked on a piece of arcade equipment (manufacturer and model shall remain
nameless) that used a bunch of solenoids to control the works under glass.

A little race condition in the code allowed one of the smaller solenoids to
stay in a duty-cycled state, effectively turning the coil into a small space
heater. Given the right play conditions and length of play, the coil could
catch fire, and a couple of times it did. Lots of wood and plastic under glass
made for a fun little display.

I heard one story about a unit in Paris being dragged out of a cafe and into
the street, then put out with axes and buckets of water. Wish I had been there
to see that.

------
btilly
How do you define worst?

How about most widespread? Once while trying to debug a CPAN module I figured
out that if $condition was false then Perl had a bug causing

my $foo = $bar if $condition;

to leave $foo with whatever value it had on the _previous_ function call. (The
exact behavior is more complex than that, but that's a good first
approximation.) I then made the mistake of reporting this in a way that made
it clear that

my $foo if 0;

was a static variable. Cue years of people like me trying to get the bug fixed
and other people trying to keep it around. In the meantime _EVERY_ large Perl
codebase that I've looked in has had this idiom somewhere and it has caused
hard to notice and reproduce bugs.

How about worst damage to a system? Due to a typo I once caused my employer to
send the Bloomberg's ftp system every large file that it had ever sent. Since
it sent a large file every day, this crashed their ftp server, meaning that a
number of feed providers for Bloomberg didn't have their feeds update that
day. I implemented no less than 3 fixes that day, any one of which would keep
the same mistake from causing a problem in the future.

How about most (initially) bizarre? Debugging a webpage that a co-worker
produced where, depending on the browser window size, you couldn't type into
the form. The bug turned out to be a div that had been made invisible but left
on top of the page. At the wrong window size it was in front of the form, so
you couldn't click on the form elements behind it. (I figured this out by
first recreating it with static HTML, then using binary search to figure out
what parts of the page I could whack and still reproduce it until I had a
minimal example. Then it was trivial to solve.)

~~~
jacquesm
That second one reminds me of this: An ISP called planet internet changed
their homepage, only to find that they reliably crashed Explorer (3 at the
time).

Took a while before the phone rang if I wanted to have a look.

It turned out they had a little animated gif in there with the inter-frame
interval set to 0, causing a divide by 0 in Explorer.

That gif was pretty much the last suspect on the list.

Divide & conquer until you are simply staring at the solution and still you
don't see it...

------
synnik
The store locator function on a national pizza chain's web site would
completely hang their web server whenever an international search was done.
Many, many hours and days of testing and debugging led us to conclude, and
build a proof that it was a reproducible bug within IBM's Domino platform,
only on AIX boxes, only when: 1) A script using LotusScript, their proprietary
language was kicked off, and... 2) A Java agent was then kicked off before the
original script completed.

At the time, Java was a new feature within that platform, so there weren't
many apps that mized both languages.

After getting to this point, IBM joined in the fix effort, and we had daily
conference calls, on which we always had IBM execs lurking because their 6.0
release of the platform was imminent, and this bug had the potential to wreak
major havoc if not fixed before launch.

So I cannot personally claim to have done the actual bugfix - the IBM
programmer did that. But it was a great learning experience to work together
with IBM to find and fix it.

~~~
jacquesm
Dominos Pizza :)

------
JeffJenkins
Non-strict comparator with the STL. Nearly impossible to identify when it's
happening. It's happened once to me and once or twice to a coworker over the
last couple years, and takes 3-5 days to debug every time.

Example:

    
    
      struct FooSort {
        bool operator()(Foo const& a, Foo const& b)  const {
      -    return not a < b ;
      +    return b < a;
         }
      };

~~~
hazzen
Try a worse version of this - comparing the results of a floating-point
function call where the left operand gets moved into the 80bit fp unit during
computation, and the right operand stays in register.

Obviously there is a precision difference between the two numbers, enough to
make a < b and b < a return true in a surprising number of cases. The way I
ended up fixing it was by putting the result of the function call in a member
variable in every struct, pre-computing all of the results, and comparing
based on that value.

~~~
dkarl
I hit this, too. Regression tests were failing when I changed code that
obviously shouldn't change the output of the program at all. This happened on
a regular basis when we changed numeric code, because of the normal
limitations of floating-point arithmetic; we just made sure the numerical
results were accurate and updated the regression tests. (The regression tests
were quite handy for finding logic errors; they weren't really used to test
numerical accuracy.)

But in this case I was just adding some error checks, which weren't even being
triggered. Clearly this shouldn't affect the results of our numerical
calculations. Since my code shouldn't affect the calculations, I was convinced
that our existing numerical code had a subtle memory or timing bug. (I knew
that floating-point code was tricky, but clearly I was doing _exactly the same
operations_ on _exactly the same values_.) I spent days staring at code, and
then my boss told me to stop working on it since the results were clearly
correct in both versions, even if they weren't identical.

A few weeks later I read about how values change when they're copied out of
the x87 stack into registers. And I thought, naw, we couldn't possibly be
using x87 arithmetic. But we were. Which was horrifying, since floating-point
calculations could be a bottleneck under some workloads. But we had been
running that way since before I started working on it, so at least it wasn't
my fault. I added a compiler option to request sse2 floating point instead of
x87 floating point. Voila, predictable floating-point results, plus measurably
faster performance on a few tests.

------
sgoraya
This bug was solved _after_ the title went gold and shipped in US :/ We did
not find out about it until a few users emailed us. The bug got through SONY
and our internal testing...

-It was a first gen PS2 game that had to get pushed out since it was considered a launch title. The dungeons in the game were seeded and randomly generated - In some instances, a lever to proceed to the next dungeon was made inaccessible due to the random seed that enclosed the lever with 'wall' tiles - the player either had to start the game over and hope the dungeon was seeded correctly, OR play through the game without finishing that particular dungeon. It was not total killer in the sense that the game could still be completed, but it was tough to know that such a glaring bug got through...

-Fix was made for the Korean version :) Basically a check to make sure that wall polys did not enclose any levers prior to generating the dungeon, if the lever was bounded, re-seed, generate and check again...

------
yummyfajitas
The nastiest was an interaction between two libraries. If you simultaneously
imported xml.dom and matplotlib (both python libraries), and then called
functions in one of the libraries (I think matplotlib), the program would
segfault.

My incomplete set of unit tests didn't catch it because they were imported by
seperate submodules of mine.

I only managed to find it by writing a unit test, going back in version
control, searching for the first revision where the problem occurred, and
going line by line through the changeset (luckily I commit early and often).

------
keeptrying
While I was working at Lucent, they had an builtin ftp client which was used
to transfer images to the router from a server.

Right after I checked in my code (a completely unrelated feature), this ftp
facility broke such that you couldnt ftp an image to any router on any
platform. So I was assigned this bug which basically stalled the release on
every platform that we supported. (The CTO himself called me.) Now you have to
realise this BigCo supplies every cell phone carrier in the nation - Sprint,
Verizon etc.

What I found was that the ftp system had a bug such that if the image being
ftp-ed was an exact multiple of 8k (or something like that) then it would
fail. My checkin made the image file an exact multiple of 8K. (Story of my
life!)

I found the bug, emailed the CTO and he assigned someone from the core team to
fix it. That guy calls me up and in the end I fixed it myself using vi and
kibitz.

------
kls
Nameless BigCo came in for PCI compliance on Struts 1.something they coded the
credit card number as a member variable of a struts action (struts 1 used a
singleton pattern) the last submitted person got to pay for all of the
simultaneous transactions going on in the system. Fortunately my team caught
it in acceptance review.

------
rm-rf
Worst software bugs, in increasing order of severity:

An ASP.NET app that didn't consume it's record sets before closing it's
database connections, thereby breaking IIS's connection pooling and causing
hundreds SQL server login/logouts per second. Lots of interesting side effects
on that bug.

An obscure page latching problem on a high volume SQL server (hundreds of
transactions & thousands of queries per second). The SQL server would spend
all of it's time waiting on page latches, response time would grow
exponentially. An eleven hour phone call with MS support finally identified
the problem.

Oracle 10.2.0.[1234]

~~~
gaius
_You_ are responsible for Oracle 10?

We need to have a word.

~~~
tlrobinson
I don't know anything about Oracle 10, can you elaborate?

------
WesleyJohnson
I'm still a little new to the game, so I've only had one hacking job thus far.
It spanned 4 years and taught me a lot about what to do and what not to do. I
could type a small book about all the crazy things they did in that company
and the nightmare of spaghetti code, the needlessly complex 3000+ table
database, but I'll try to focus on just a few things that went wrong over the
years or that needed fixed.

Two of the previous employees got into a debate on whether or not you could
directly connect to the database through JavaScript and fetch recordsets, etc.
To prove that you could, one to the developers did just that...and left it in
place...in production...with the full connection string and user credentials
in plain site in the javascript code on the page.

Countless sql injection holes were plugged over the years, but not before we
got hit with an attack that plugged javascript ad code into 50 or more tables
and a few hundred thousand rows.

The initial developers used ".inc" as the file extensions for include files
like headers, footers and database access. These files sometimes had html,
javascript, asp or all 3. You never knew. The problem was IIS treated these
like text files and would gleefully serve them up if accessed them directly -
revealing server side sourcecode, more connection strings, etc, etc.

Our accounting "system" was woefully inadequate. Our sales people always said
"yes" and we eventually ran into some clients that wanted their invoices
formatted and calculated in a way our system couldn't handle. We were always
far too busy to take the time needed to properly engineer the new solution, so
this one clients bill was done manually - through SQL Management Studio -
every month, for close to two years before I left. Reconciliation of the bill,
was also done manually. The kicker is, billing was done based on dates that
assignments were completed. These dates were not only not locked down once
billing had begun, but new assignments could be injected into the billing
period long after the invoice was generated, because multiple factors.
Needless to say, it took me half a day, once a month to reconcile their
invoices and on several occasions, due to the very manual nature, payments
were applied to the wrong assignments, transactions, etc. Once to the tune of
$200,000. That was fun to fix. :)

I could go on and on and on and on....

------
MrMatt
I worked on a taxi booking and dispatch system, written in c, and running on
dos with custom networking via RS232. This system was installed at around 300
locations around the UK, and on one fateful day, every installation crashed.

It came down to me to find and fix the problem, and it was subtle. The clue
lay in the fact that all of the sites that crashed did so within about a
minute of one another.

Turns out that some of the old, old sections of the software had been written
by the MD, who, despite referring to himself as 'the emperor of c', was in
fact an atrocious programmer.

The actual trigger was the comms system looking at a byte that determined as
to whether a message had been received. This byte was set to the character 'A'
if a message was received. It just so happened that the first byte of the
current value of the number of seconds since 1970 evaluated to 'A', and had
been written into that memory location a negative index into an array that
hadn't been initialised.

This negative index into an array that _shouldn't_ have been empty caused a
section of memory to be overwritten that made the comms system think that it
had received a packet. This snowballed quickly, and took down the system
within about five seconds of boot.

Took the best part of two days to track down, and, of course, it was everyone
else's fault but the emperors.

~~~
duskwuff
Let me guess: The crash occurred slightly before 6 PM on July 22, 2004?

------
tlb
At Viaweb, my careless use of $_ in Perl led to the name of every credit card
shown in the merchant UI being replaced by a secret auth key. I didn't see it
because that particular auth check was bypassed for admin users like me. It
wasn't a serious security hole, and we changed the auth key afterwards. But as
it happened, I was in a cranky mood when I created the auth key and it had
some bad words in it, so it was a little embarrassing.

------
dkarl
I worked on a program that ran large batch jobs, sometimes taking more than
twenty-four hours. This was actually spectacular performance, since we used
custom hardware to do most of the computation. I wrote the code that
interfaced with the hardware. When the code timed out trying to talk to the
hardware, the only sensible thing to do was report the error and abort the
program.

Unfortunately, this seemed to happen quite often. Jobs would abort randomly,
after about eight hours, sometimes much less and sometimes not at all.
Overheating was the obvious first suspect, so that was investigated and ruled
out. The hardware was running cool and was in perfect working order. The
customer started splitting jobs into shorter batches and combining the results
by hand. We wrote code to help them automate this workaround. But batches were
still randomly aborting. And I couldn't replicate the bug, despite having
identical hardware to the customer. Something in the customer's environment
was essential.

Eventually, somebody at the customer figured out the problem. My boss called
me up and said, "The customer suspects your code is not time-travel
compliant." It was true. My code assumed that time always goes forwards. If
time ever went backwards while my code was waiting for hardware, it would
immediately time out and abort the batch. And our customer encountered a bug
where time _did_ appear to go backwards occasionally. I was too stressed out
over other tasks to ask for details of the bug. I just sent them a fix and
breathed a sigh of relief when they accepted it.

A Google search now reveals that there was an issue with time going backwards
under Xen on dual-core Opterons, which is what the target platform was. They
never told us they were using Xen. Maybe that's why they were much nicer to us
after the problem was diagnosed!

------
jacquesm
Ok, here's one of mine, it's only fair.

Jasper L. systems administrator of an early web hosting company calls up one
evening, there is a problem with the paging system.

A certain host is being paged as 'offline' but when checking the machine works
fine.

So I go there bit by bit we check out the software, everything works fine, but
sure enough every 10 minutes or so the machine (called 'chopper', I'll
remember that for the rest of my life) gets reported as 'down' again.

But there is absolutely nothing wrong with it.

After ruling out all the software bits we figure it must be hardware somehow.
The way the supervisor works is it sends a ping packet to the machine, and if
the machine responds it is deemed to be up. But chopper misses one ping out of
every 5 or so, and sometimes several in a row.

We swap out the computer, move the scsi drive to another box and boot it.
Chopper registers as off-line, but works just fine.

More confusion, finally, out of desperation we start messing with the network.
This is all 10Base T, coaxial cables with little T connectors on the machines
and a terminator at the end of the line on the last 'T' to make sure the
impedance is right.

The terminator was on, so that wasn't it.

We went for a break at that point, we'd been at it for hours. Finally, by
elimination we figure the _only_ things we haven't changed yet are the drive
and the 'T' connector, but surely that can't be it.

And it was... somehow that T connector was a pretty good filter for some bits
in the incoming or outgoing ping packet changing one of the bits, causing the
IP checksum to fail. No returned ping... we replaced it back and forth 3 more
times just to make sure we weren't seeing things.

~~~
Poiesis
That's not 10BaseT, it's 10Base2 (or thinnet):
<http://en.wikipedia.org/wiki/10BASE2>

And--even though I've strung cables for both kinds, and of course 10Base2
takes far less cable, the maintenance issues and troubleshooting headaches of
the bus topology make me _so_ glad we've progressed to more robust network
topologies.

~~~
jacquesm
Ah, yes, of course, you're right. If I could vote I would vote you up but my
voting seems to be on the blink.

~~~
tyrmored
Define "irony" :)

------
ja27
I worked on the communication middleware for an early Windows tablet. My part
sat between a VB6 GUI app (that we customized for each customer) and a RS-232
device and handled all the communications.

One customer had intermittent communication failures that we couldn't identify
for weeks. I finally went on-site and hooked up a serial port logger between
the tablet and device. After an hour of testing, I finally captured what
caused the communication failure. It was some debug message from the GUI app,
somehow coming across the serial port. I called back to the office and had a
guy from our group go over the the VB app team and ask which genius was
logging stuff to the serial port. Sure enough, some guy had been using VB's
remote debugging feature with a null modem cable and left it turned on in the
shipped app.

------
dazzawazza
Every since time there is a memory corruption on a games console it takes
hours and hours of tedious detective work to find.... shudder. Glad I don't do
that any more.

~~~
samlittlewood
Yeah I had 'fun' with an embedded bootloader that would corrupt a single byte
in the image (at offset 2^20 iirc)

------
rbritton
We set up a new network segment with multiple VLANs trunked over a fiber line.
The entire thing worked flawlessly except for one VLAN. It turned out to be a
conflict between a very strict media converter and bad ARP packet-generating
code in the device acting as the DHCP server on that VLAN. The ARP packets it
generated were too short for the media converter, so it dropped them thinking
they were damaged and the computers on the VLAN were unable to find the
gateway. The end solution was to go with a dumber media converter that didn't
work at the packet level. The DHCP device to this day still has the bad code.

------
philh
The bug itself wasn't very interesting, just a brainfart. But it wasn't doing
what it should, so I added debugging output - one line per pixel. (I was
generating a png from a custom image format.)

Because there was so much output, I did a `| head` to keep it manageable. Saw
what I was doing wrong, fixed it, reran the command, it seemed fine - but the
output image looked exactly the same as it had before.

It took me about an hour to realise that once head had exited, the pipeline
sent a signal to my program, killing it. The image wasn't getting written
until after all the pixels had been processed, so it never got to that stage.

------
itgoon
I once worked an issue with a crashing Exchange server.

It had to do with a connection being reset in the space of time between the
server checking if the connection was good, and actually using it.

The client could reproduce it at will, and after a little bit of code at home,
I could, too. Back at the office, no repro.

It took a lot of back and forth (to put it mildly), but the problem at work
was that the network was too slow - the server got the reset with plenty of
time to "notice" it, and the crash didn't occur.

Once we set up the repro on its own switch at work, everything failed in the
expected manner.

------
uggedal
Peter Seibel ask this question in all his interviews in Coders at Work:
<http://www.codersatwork.com> It seemed like most were concurrency related.

------
sdave
most embrarrasing:

" #define INTERVAL 10 * 86400 "

read the above 'C' statement. Yeah, there are no brackets surrounding the
'10*86400', INTERVAL was being used in some calculations.i wasted a lot of
time debugging this crap.

------
kabdib
Embedded OS for a consumer product; the units were freezing (very
occasionally) in the field. Usually the units were in store kiosks (=
disappointed and unimpressed could-have-been-customers).

Turned out to be a race condition in an interrupt handler, where the OS would
say "wake me up when something interesting happens" but something else would
sneak in, clobber the wake-up trigger, which meant "... never."

Two weeks to find it, fixed by swapping two instructions in some assembly
glue.

The harder a bug is to find, the simpler the fix is.

~~~
Poiesis
"Simpler" depends if you count the troubleshooting time. To rip off an old
anecdote:

    
    
      Change one line of code: $3.00
      Find which line of code to change: $3,436.88

------
bad_user
I once did a crawler that searched the content of a couple of large websites
for phone numbers.

I did this daemon in Perl, using fork() to search multiple websites in
parallel. When a new search was initiated, the daemon broke the search in
multiple packages (of 5000 items to be searched). After initiating 5000
searches (keeping the maximum number of active processes at 30), I did a wait
for all the children to finish, so that I can mark the package as "done".

The trouble with fork is that opened sockets are sometimes not forked very
well. And I forked my database connection (a DBD::Pg). Forking a DBD::Pg
connection went well in my tests, but once in a few thousand forked processes,
a process would stall. Sometimes it recovered after a few hours, sometimes it
didn't.

I tried setting an ALARM that would make the process auto-kill itself after
some time. Didn't work. The final workaround was to monitor the child
processes from the parent, and kill -9 the ones that hanged. But the whole
processing became too slow.

Finally I gave up on the idea of packages and on marking the search as "done".
So I just marked the search as "done" after the last child was forked.

It was a difficult bug because the client had no idea it was a bug, and waited
patiently for the results (sometimes as much as 2 days). It was also difficult
because at first I had no idea why my processes hanged.

------
stcredzero
Worst bug monetarily:

I was consulting for a multinational on a commodities trading app. I broke the
stuff sending trade data to the Risk Management system that does automatic
hedge trades. My manager comes down the hall and tells me I've caused $4
million in exposure in the past two hours. At the end, her comment is: "A
million here, a million there, business as usual!"

I told that story to a former coworker, and he pointed out, "Well, your boss
was calm about it. I bet those traders were _livid_!"

~~~
newsdog
so it's your fault...

~~~
stcredzero
Yes, it was my fault. And we fixed it and rolled that out to production in
minutes. And those trades probably got their hedge trades done manually. But
at the moment, it seems pretty bad.

------
cmon_scum
A neat piece of software calculating the prices for customers Had a
hierarchical pricing model for different contracts with large corporations.
But the hierarchies were not separated. You started in one hierarchy and ended
up in another one. While trying to read the tree from the bottom up you could
end up reading the whole database for calculating a price with conditions from
every customer.

I must not state the accuracy of that price...

------
MaysonL
Probably the worst bug I ever encountered was a documentation bug.

Back in 1970, I was at my first programming job, working for a very small
consulting company. The contract was to convert a large system from 1401
Autocoder to System/360 Cobol. However, instead of the source code for the
current system, we were given system specs, which were almost flawless.

However, in final acceptance testing, one program, for one particular
accounting line, kept producing results that differed from the current system.
The customer would not accept this, even though my code implemented the
specified calculation precisely.

Eventually, after a few rounds of back and forth, I demanded to see the source
code of the current system, and after much more back and forth, the source
listing was produced. Looking at the source, it seemed to be doing exactly
what my code was doing.

There was one fact not taken into account - the present system, rather than
running on the 1410 for which it was written, was running in emulation mode on
their new 360. And they hadn't bought a license for the compiler. So the small
accounting change they had made a few months prior had been made to the object
code - the card deck itself. Luckily the programmer who had made the change
had written a note to that effect on the listing - in the object code part,
where he eventually sheepishly showed it to me. We then got paid.

(I also once had a bug that almost got national distribution via Time
magazine. Luckily they had only printed a few thousand copies before someone
noticed it. That one would have gotten me fired, but for the fact that the
same release also included a halftone compression routine that enabled them to
push back their photo deadline by a day. The reason the bug escaped my testing
was that it only occurred in pictures where the line count was of the form
4n+3 - I had tested with both even and odd line counts, but never managed to
use one that produced the off by one error.)

------
gills
We were doing some early testing on this distributed system, and process A
kept backing up well under it's nominal load. A had no allowance for shedding
excess load (it was broadcasting high frequency safety-critical data), and the
network buffer backed up under the shitty messaging layer. It turned out that
process B[0..n] couldn't pull the messages off the wire quickly enough because
process C was blasting some other data to B at about 1000 times the nominal
loading, filling up B's VM and kicking off the (improperly-tuned) garbage
collector for 2-second intervals -- it ate the processor time needed to handle
the load. Total death spiral.

Needless to say we ended up with more robust load management code and tuned
the output of some processes.

------
SingAlong
One bug in Daylife's API tester's javascript. It just never worked in IE for
some stupid reason. The solution was surprisingly simple.

After some analysis with firebug, I just figured out that the variable's value
wasn't being preserved after a particular point. So I just had to take it's
value and assign it back to itself.

It's in this file: <http://demos.daylife.com/api_tester/js/daylife.tester.js>
(lines prefixed with comments "IE Fix")

[Full disclosure: I was not and I'm not a part of Daylife. I only participated
in a contest they once conducted and was asked to try solving their problem by
one of their staff.]

~~~
sandaru1
Is this relevant to following by any chance?

<http://karma.nucleuscms.org/item/101>

~~~
SingAlong
exactly. thanks for the blog post link.

------
tlrobinson
This isn't a particularly bad one, but it's the first one that popped into my
head, and is somewhat unexpected unless you're very familiar with how
JavaScript RegExp's "exec" method works:

    
    
        > r = /^x$/g
        /^x$/g
        > r.exec("x")
        x
        > r.exec("x")
        null
        > r.exec("x")
        x
        > r.exec("x")
        null
    

(When exec is used on a regex with the global flag it will remember the
position of the end of the last match and perform the next match beginning at
that offset. Obviously will cause very bizarre behavior if you expect it to be
idempotent...)

------
vanekl
Worst: c/c++ pointer memory errors, duh. Especially when there are thousands
of pointers and don't know which one is overwriting memory that it shouldn't
be overwriting.

Second worst: linking c programs when some of the symbols are duplicated in
more than one library yet are defined differently, Nobody mentioning this to
the developer made it all the more exciting.

~~~
nwatson
Thanks to the Boost.org developers the C++ pointer issues are mostly a thing
of the past. Using shared_ptr et. al. for all but the most time-critical code
has made memory-smashing, memory leaks, etc. so easy to avoid!

------
tlb
Calling the least mean square fit function in the linpack library used to go
into an infinite loop occasionally. I tracked it down to some very dusty
FORTRAN code in the linpack kernel, which gfortran42 compiles incorrectly.
Adding -ffloat-store to the compile flags for that library fixed it.

------
jokull
ExternalInterface hell (Flash-JS bridge)

~~~
tlrobinson
Like how the moment you move a Flash object ExternalInterface breaks
completely in IE?

------
brazzy
Define "worst". The most costly? The hardest to diagnose? The one with the
most stupid cause?

~~~
jacquesm
Take your pick. Whatever works for you. For me the definition would probably
be the hardest to diagnose, but I'm sure there are better ones that lead to
more interesting stories and lessons to be learned.

Bugs are a great learning experience.

------
earl
The most expensive bug that I can talk about publicly that I've encountered:

I used to work on software for a very expensive (started at 70k) DNA/protein
analysis hardware solution. This was in the late 90s, and our GUI was a couple
million lines of MFC code. I was responsible for an analysis package written
in cross platform (linux and windows) C++, but the main UI and all the complex
interface was MFC.

My then boss was a unix idiot who hated Windows. Which is fine and all, but a
Windows product was paying our salaries. So one day he decided to rewrite my
(well tested, though not unit tested) file handling code written using the
win32 API and port it to posix. Not to accomplish anything different, mind
you, and not to add it to the cross platform bits since it was useless in
isolation and, oh yeah, we had multiple millions of lines of MFC so nothing
was ever getting ported.

In any case, during this "upgrade", he found a function called DirTouch, which
was intended to make sure certain directories existed during the normal course
of saving data. Well, this gentleman subtly changed the semantics from "create
the directory if it doesn't exist" to "create the directory if it doesn't
exist, but if it does already, then silently delete whatever is there". This
wasn't the whole bug, but this change of semantics to a destructive function
was the root cause.

This change got shipped. One of our customers killed literally more than
$1.3MM worth of data, since each machine run might cost $100K given lab time,
reagents, prep, etc.

~~~
vsync
What happened to him? And did he try to blame you? In his mind I'm sure all he
did was port your code, which must after all have been buggy.

~~~
earl
Well... it was a shitshow. We were just told by a livid CEO that a major
university (the bio world is surprisingly small and it's pretty easy to get a
terrible rep) had a major data loss, etc. We were on the phone with them while
simultaneously overnighting the computer to data recovery trying to figure out
what happened. It took two frantic days before I figured out what the problem
is, patched it, and started the emergency release process for all the other
customers.

But yeah, I initially took responsibility for the mess, since it was most
likely my code that did it. After figuring out precisely what happened, and
having to demonstrate to my boss that the old code did not delete stuff out of
directories, that responsibility got walked back a bit. Still... I got "laid
off" within 5 months of that.

Basically, there was blame to go around: the university should have had a
comprehensive backup solution in place, particularly given the expense
involved in creating the data; I should have been more careful to not
needlessly call this directory touch operation while creating cached bits in
the data files (the "data file" that the user thought of was actually a
directory because we were running into fat16 and fat32 file size limits), but
at the end of the day, my boss took code that used to work and turned it into
code that not only didn't work but was broken in the worst way possible for no
other reason than to fuck about with posix idiocy. In a giant win32/mfc app.

------
clistctrl
I've had a lot of frustrating experiences, but the one that made me slap my
face the hardest due to the shear simplicity of it was the time I was working
on a windows service. I was opening a socket and listening, but for whatever
reason I could not get the client to connect! After a half hour or so of
playing around i finally minimized the million windows i had open to see
Norton sitting there asking me to unblock the application _face palm_

------
romanm
A multi threading system and a queue: it will be hard to explain all the
algorithm here but sometimes you think that one thread can't affect others and
it is not true I had to find an exception that kills no just the thread but
doesn't allows the queue consuming to continue, I remember I went home
something like 5am.

