

Ask YC: Your most interesting bugs / bug fixes? - technoguyrob

I ran across this Reddit story:<p>http://www.reddit.com/info/6n7k3/comments/<p>And enjoyed both the article and the commentary on people's own bug troubles. Do you guys have any of your own stories (preferably as detailed as possible) about bugs and bug fixes?<p>This has to be one of my favorites:<p>http://www.ibiblio.org/harris/500milemail.html
======
pmjordan
At my game development job we got some incredibly hard to reproduce crashes
every now and then on the game that was supposed to be shipping soon. We were
using Lua for scripting, and the stack traces showed that it was happening
somewhere in Lua. This was on the Wii, so dropping to a debugger is only
possible if you happen to be running the right build on the right hardware
attached to a PC running the right software. Which meant not in the QA
department.

Not that a debugger helped that much after they managed to get a fairly
reliable but convoluted repro. It turned out that really stressing the
scripting system with certain patterns would cause the crash to happen much
more frequently, so I could see it in the debugger. It didn't help all that
much, it was an apparently random memory stomp, and by the time it crashed, it
was much too late to tell where it came from. I forget how I figured this out
but I eventually managed to narrow the cause down to garbage collection runs.
Now, GC runs were periodic, but consoles place hard limits on memory. You
can't just swap to disk when the going gets tough, so we had memory budgets
for each game component, including the scripting system. So if scripts got
particularly greedy, they'd run out of memory before the next GC run.

Now, as the memory limits were hard, some clever sod had put a GC call in the
Lua malloc hook that was supplying the memory to run when there was no memory
available (and the game would have crashed) - no doubt in order to fix an
earlier bug. Most of our scripts didn't create hash tables, arrays, and
strings frequently, so this bug hadn't been a big enough problem for what must
have been years. In Lua, those types of objects require _two_ allocations, one
for the base object and one for the data storage. You can see where this is
going.

If Lua ran out of memory halfway through creating a hash table, array, or
string, that is, after successfully creating the base object, but failing on
the data store, it would trigger a GC run. Thankfully this was actually not
that hard to hit, as the data store memory generally was way bigger than the
16 or so bytes used for primitive types (i.e. base objects, numbers, ...) so
the probability of not having enough contiguous space was much higher than not
having a 16-byte slot. In any case, the hash table (etc) constructor had of
course not returned yet, and therefore there were no references to the hash
table object yet, and it promptly got collected. The memory was initialised as
a hash table and returned from the constructor, and it was just a matter of
time until another allocation wrote straight over that. Not just any
allocation of course, as re-allocating it as a (legal) primitive type wouldn't
have caused a crash.

The fix was of course easy once the cause was known: don't put the base object
in the allocated list for GC consideration until the whole object had been
assembled.

Took me _days_. And I wasn't even the first person to be assigned the bug, it
was one of those hot potatoes that went round all the senior people until it
landed on the junior tech programmer's list. (mine)

I looked in the checkin history for the malloc hook, and they had shipped at
least one game with that bug in. (records didn't go back far enough to rule
out the game before) If you figure out what scripts to trigger repeatedly, you
can make that game crash.

I can't really blame just one person for this. Putting the GC call in the
malloc was thoughtless. Maybe I would have done the same without checking that
it was safe. In Lua itself, that was a pretty careless way to handle object
creation given that the malloc hook is user-defined, so Lua has no control
what goes on in there.

More bedtime war stories another time.

------
thaumaturgy
I'm ashamed to admit that I've still got a hefty little piece of VBScript at a
grocery store that does some price & cost number crunching for them; it
features a very funky two-stage inner loop which will occasionally produce
random numbers if you feed in very large data sets. Two different pretty sharp
people have spent a lot of time trying to track it, and nobody can find it.
One of them looked at it and about cried when he realized how it was supposed
to work.

Not too long ago, I had to figure out a timing issue in some JavaScript for a
photo gallery. It's about as multi-tasking as JavaScript can get, and the code
has its own task manager. Each new task was supplied with a UID, which was
just the number of milliseconds since 01/01/1970. Occasionally, however, a
task wouldn't fire. It would get queued, and then disappear from the queue,
but never run. Every time I set breakpoints in the script, used alerts, or
otherwise examined the code as it ran, everything worked fine.

It took me most of an afternoon to realize that occasionally two different
tasks were being created within the same millisecond, and getting the same
UID. Subsequently, one of the tasks would stomp on the other one. I kicked
myself for knowing better than that and wrote a more reasonable UID hash.

Years ago, one of the first languages I got to play with was Pascal, on a
pre-7.5 Mac OS. I was using THINK Pascal at the time, and having fun, but
occasionally my program would bomb out in really interesting ways. It took me
a long while to trace the bug to a problem in the compiler, where a struct
that was supposed to be 4 bytes long was getting loaded on to the stack in 6
bytes. Usually this wouldn't be an issue, except that one of my functions was
popping the struct off the stack as a longint, instead of as the actual
struct. Whoops.

But, probably the most fun "bug" I ever figured out wasn't really a software
issue at all. I worked in the data processing department of a Bay Area school
district. Despite being one of the better school districts in the area, and
having enough spare change for student A/V programs and the like, a lot of
their systems ran on an ancient Unisys mainframe, COBOL74, and reel tapes.
(Hey, it worked. Unisys mainframes have incredible recovery abilities.)

The reel tapes had a little reflective strip near the end of the tape, and the
tape drive had this optical sensor that would pick up the reflection as the
strip flew by, and dismount the tape.

It was the damnedest thing -- sometimes, usually in late Spring or early
Summer, a tape would be in the middle of some operation, and it would
dismount, even though it was nowhere near the end of the tape.

I was sitting in the room one day when it happened, and turned around in time
to see a perfect little sunbeam sneaking through the blinds and shining right
on the drive's optical sensor, fooling it.

------
tx
The visual plotting component I was working on would suddenly stop producing
output: just a black screen but everything else was normal - no leaks, race
conditions, crashes, etc - just no video. The bug was so rare that only
customers ever saw it, but they were pissed (serious manufacturing folks, they
ran our software 24/7 on production floors)

Eventually I wrote a debugging code around it to take screen shots of the main
window and compare it with a reference image about 20 times a second (after
each blit) and left it running for months on an old PC.

One night I got an SMS sent by my debugging script with attached call stack,
etc, but I knew what had happened without even reading it: I got it 1 second
after daytime savings time kicked in - the plotter didn't handle it properly
and was trying to plot data from the future, which, of course hadn't happened
yet.

------
CRASCH
Some of the more interesting bugs are the ones not in your code.

I decided to list the ones outside my code that I've found in the last year or
so. These are windows / .net bugs.

1\. When you add a program to run at startup to the registry and that program
has a space in it. Like this "CoolCompany CoolSoftware.exe" it will also
launch the program in the same directory called "CoolCompany
ReallyCoolSoftware.exe"

2\. The FILETIME data type in .net is mis-defined as signed. This leads to
conversion problems that were really hard to track down. File times would
magically change in certain conditions by a few seconds to a few minutes. This
was still in the 3.0 version which is the last I checked.

3\. .net Socket library pins the buffer in memory to prevent garbage
collection and allow the call to not lose the buffer. Unfortunately it never
unpins the memory. This looks like a memory leak and fragments the heap badly.

4\. The split control width is reported differently on XP and Vista. Hiding
the split on one platform will give a width of zero. On the other one it
reports the width.

5\. The zip lib on .net CLR ends up with much worse compression than a
properly implemented zip lib. Ironically the J# zip lib works well.

6\. The windows setup in visual studio release build fails to install required
components like .net. The debug build will?

7\. The .net file and directory objects don't support long paths. A user can
easily create a path longer than the limit by dragging another user's
directory to their desktop. These files will (un)conveniently be unreachable
by any .net app not making direct win32 calls.

8\. I don't even want to think about the threading model in .net with
delegates (they don't always work) etc. Bonus they work differently on Vista
yea! Give me a semaphore, a mutex, etc.

9\. Vista UAC is such an embarrassment. I'm completely astonished it got
released. In my opinion everyone who even saw it should be fired. The design
is worse than privilege elevation designed over a quarter century earlier. The
implementation is so bad any still sane person that uses their computer for
more than light work will turn it off.

Windows aside. If anyone is developing a commercial app on windows. I strongly
recommend writing directly to the win32 API. If you try to implement anything
non-trivial (like I did) you may end up with a huge rewrite. In my opinion the
convenience of .net is just not worth the loss of control.

I'm in mid port to a native code version of my application. I've lost 95% of
the re-distributable size and increased performance by 50%.

Seriously though. I think there is a general problem with programming.

Part of the problem is the workaround. There is a certain amount pride /
respect someone gets for figuring out a work around to a tough problem. The
issue with this is that it discourages the fixing of the root problem.

The other part of the problem is consistency. "I don't care what the interface
is as long as it is concise and consistent." I often feel like a turn of the
century mechanic trying to build a car from a box of randomly sourced parts.
At the turn of the century there were not any standards for important things
like bolts. Everything varied from thread width to thread angle to thread
spacing. The situation with non standard mechanical parts is pretty close to
what we have today with software. Sometimes I think all I do is write code to
convert data from one type to another. All to appease the requirements of the
various APIs I need to use.

Someone will probably note that I'm on windows and X is so much better to code
to. I will prematurely completely agree. I typically hack on my mac and my
grid runs on linux boxes. I need a windows component for market reasons.

Bonus!!! The Unknown or expired link when posting to Hacker News after typing
for a long time.

~~~
xirium
> Bonus!!! The Unknown or exprired link when posting to Hacker News after
> typing for a long time.

Definitely. I hit it after reading your comment. If you ever want a technique
to discourage long and insightful comments, this is it.

------
qwph
One bug I had to fix recently was in the commentary system of a PS2 soccer
game (with an ancient codebase, parts of which hadn't been touched in 10
years).

The commentator would always refer to a particular player on a particular team
by the name of a different player on that team. No-one had any idea why.

This issue had been knocking around for years and somehow ended up on my
plate. After about 3 or 4 days of digging around and finding nothing, I
finally got to the bottom of it. It just so happened that subtracting the ID's
of the two players in question gave exactly 65536. It didn't take me much
longer to figure out that the game's database used 32 bit integers for player
ID's, and the commentary system was using 16 bit integers, so from the
commentary's point of view, the two players had the same ID.

Luckily for me, I was able to fix it by swapping in a slightly-less-ancient
commentary system which used 32 bit IDs instead.

------
wataguy
My company made a machine for medical labs, and I wrote an app that monitored
the performance of those machines. Our basic metric was samples analyzed per
hour, and my app would grab the runlogs from all the machines (we needed the
data in the logs for other purposes, or I'd have done the analysis on the
machines and just transferred the results), count how many samples were
analyzed each calendar day, and then divide by 24. Easy, simple, and worked a
treat. That is, until one fine spring Monday.

The service manager walked into my office bearing the graph of machine
performance, and said there was something wrong with my script. He showed me
the graph, and pointed out that most of the machines had exhibited a nearly 5%
drop in productivity on Sunday. This wasn't usual; the machines either worked,
or they didn't, and they were expensive enough that labs generally ran them
continuously. We looked at the graph for a while, and realized that the
affected machines were all in the US; the ones in Asia looked normal.

This was baffling, but I did a little checking of the data, and it looked like
the script was counting samples correctly. The nearly 5% drop was real.

But the next day everything was back to normal and people stopped bugging me
about this, so I just let it go, though I couldn't get rid of a nagging
feeling that I was missing something.

I didn't realize what it was until six months later, when those selfsame
machines exhibited a nearly 5% rise in productivity, again reported to me by
the service manager on a Monday. I looked at him for a moment, then said, "Not
all days are 24 hours long!" I'd forgotten about daylight savings time (not
observed in Asia), which makes one day in the spring only 23 hours long, and
one day in the fall 25 hours long.

I fixed the calculation by using time()s, so the script'll work for any time
zone. As an added bonus, even the occasional leap second is handled correctly.

------
jfarmer
When I was working on Adonomics there was a bug that caused certain virus
scanners to claim that we were trying to install spyware on the visitor's
computer.

It just so happened that there was a string on the page that matched the
scanners' virus signature. It was in the HTML so between a few of the
offending lines I inserted <!-- THIS IS NOT A VIRUS -->. Problem solved.

I'd never, ever seen a website that did that, let alone one that I created!

------
bayareaguy
I once had to fix a random memory corruption in some X windows code which used
a proprietary third party library to which we didn't have the source. After
linking with my own malloc/free library that would carefully add lots of guard
space around every allocation I discovered that it tended to overwrite an area
just past a structure it was responsible for allocating. We reported the bug
to the vendor but we had a demo coming up so in the meantime we just wrapped
calls to their stuff with some code that swapped in my allocator. Simply
allocating 10% more than what the library asked for made the problem go away.

~~~
davidw
Ooh, those kinds of bugs are evil. I once had a bug in Rivet caused by Apache
and Tcl linking to different versions of the same struct, getting confused,
and stomping on some memory. It was really a bitch to track down:-/

------
jrockway
This is pretty common. I used to name my submit buttons "submit", but when I
started using Javascript to do the submission, I found that form.submit()
didn't work everywhere. The reason is because form fields are accessed by
form.field_name, and so the field data replaces the function stored in the
"submit" slot. _sigh_.

------
davidw
I love doing quick, hacks to improve stuff. Sometimes it's even more fun if
it's a system I'm not familiar with. Not really a 'bug', but I made Linux boot
faster off of USB pen drive type devices by adding a wait queue:

[http://www.welton.it/freesoftware/patches/blkdev_wakeup-2.6....](http://www.welton.it/freesoftware/patches/blkdev_wakeup-2.6.18.patch)

Although I don't know if it ever made it into the official kernel, as I got
tired of prodding those guys to either accept it or reject it (and in the
meantime they put in some lame hack telling the kernel to wait N seconds
before proceeding in the hope that that was long enough to mount the device),
but a number of people have thanked me for it over the years.

BTW, I'm in awe of people who spend all their time hacking kernel stuff. It's
hard work, and for many things, you risk crashing your whole computer if
things go wrong.

------
rtf
On the game project I'm working on(as a designer) we use an internally
developed scripting language that was originally meant for simple timelined
cinematics. It was, according to the author, made in one day.

Each line gets a time in seconds followed by a call on some entity: hello
world is "0.0 Script:Print("hello world")". The syntax quickly gets verbose
from there - and no multi-line statements allowed!

Some of the flaws of this language:

-Miserable computational abilities. Until just this morning we were unable to add two numbers together. The main limitation is poor access to information - getting things like the x position of an entity requires a language extension - an easy one, but when it's not there....

-A vague notion of types. Variables attached on entities are typically enclosed in quotes: "true", "100", etc. But in arbitrary instances you can also use space-delimited vectors, for example: Script:Goto(100) or Entity.SimpleMovableInst:SetPosition(0 0 0). If you use the wrong thing, you will be lucky to get an error message. Usually it just silently fails.

-Lexical macros. These were added in order to cut down on the number of custom scripts we were using: instead of cut-and-pasting "switch1...door2....etc." variables we could say "$[target]" and it would be replaced at runtime with the contents of variable "target" on that entity instance. But they have the effect of giving a lot of false errors in the cases where we don't need to use those variables on a rich, multi-purpose entity. Worse, if I forget to add a variable necessary for the script, that instance will fail and I may not realize why.

-The parser doesn't recognize delimitation of strings, among other things. Putting a colon inside a print statement causes an error message.

It is no surprise that endless grief has resulted from these various flaws. I
fixed bugs resulting from all of the above just today.

It works fine for timelines, but I'm hoping to get Scheme(or something with
similar potency) for the next project.

~~~
LogicHoleFlaw
Lua is designed for exactly the type of work you describe here. I wrote a Lua
interface for The Nebula Device (Open-sourced 3d game engine from Radon Labs.)
It took about a weekend, starting from scratch. Interfacing to C or C++ code
is dead simple. And the language itself draws a lot of inspiration from Scheme
without abandoning a more familiar syntax. You get first-class functions,
closures, lexical scoping, coroutines, and a clever "metamethod" system for
defining object behavior. It's an extremely quick and efficient implementation
written in strict ANSI C, and it works well in constrained environments like
games. It's quite popular in both console and PC games for higher level
functions on top of a low level engine. Give Lua a look-see!

~~~
rtf
I integrated Lua once for a project of my own. But on this project the choice
wasn't mine - we ported over existing technology to a new console to meet a
five-month ship deadline.

Also, Lua is imperfect for games: Squirrel is considerably more attractive
because it is designed to work in real-time situations(no GC interference). I
also had issues with Lua's error-handling.

------
tlrobinson
Debugging VHDL on a FPGA (which ran correctly in a simulator) is, uh, fun. I
don't remember the details, but it was a single _character_ fix.

Also, debugging MS Office file format generating code is painful when the only
feedback you have is Office giving you "corrupted file" errors.

