Not that a debugger helped that much after they managed to get a fairly reliable but convoluted repro. It turned out that really stressing the scripting system with certain patterns would cause the crash to happen much more frequently, so I could see it in the debugger. It didn't help all that much, it was an apparently random memory stomp, and by the time it crashed, it was much too late to tell where it came from. I forget how I figured this out but I eventually managed to narrow the cause down to garbage collection runs. Now, GC runs were periodic, but consoles place hard limits on memory. You can't just swap to disk when the going gets tough, so we had memory budgets for each game component, including the scripting system. So if scripts got particularly greedy, they'd run out of memory before the next GC run.
Now, as the memory limits were hard, some clever sod had put a GC call in the Lua malloc hook that was supplying the memory to run when there was no memory available (and the game would have crashed) - no doubt in order to fix an earlier bug. Most of our scripts didn't create hash tables, arrays, and strings frequently, so this bug hadn't been a big enough problem for what must have been years. In Lua, those types of objects require two allocations, one for the base object and one for the data storage. You can see where this is going.
If Lua ran out of memory halfway through creating a hash table, array, or string, that is, after successfully creating the base object, but failing on the data store, it would trigger a GC run. Thankfully this was actually not that hard to hit, as the data store memory generally was way bigger than the 16 or so bytes used for primitive types (i.e. base objects, numbers, ...) so the probability of not having enough contiguous space was much higher than not having a 16-byte slot. In any case, the hash table (etc) constructor had of course not returned yet, and therefore there were no references to the hash table object yet, and it promptly got collected. The memory was initialised as a hash table and returned from the constructor, and it was just a matter of time until another allocation wrote straight over that. Not just any allocation of course, as re-allocating it as a (legal) primitive type wouldn't have caused a crash.
The fix was of course easy once the cause was known: don't put the base object in the allocated list for GC consideration until the whole object had been assembled.
Took me days. And I wasn't even the first person to be assigned the bug, it was one of those hot potatoes that went round all the senior people until it landed on the junior tech programmer's list. (mine)
I looked in the checkin history for the malloc hook, and they had shipped at least one game with that bug in. (records didn't go back far enough to rule out the game before) If you figure out what scripts to trigger repeatedly, you can make that game crash.
I can't really blame just one person for this. Putting the GC call in the malloc was thoughtless. Maybe I would have done the same without checking that it was safe. In Lua itself, that was a pretty careless way to handle object creation given that the malloc hook is user-defined, so Lua has no control what goes on in there.
More bedtime war stories another time.