Java app, deployed to four servers (by rsyncing a zip file and unzipping it). One of the four fails on startup, in some way that's hard to trace (I think it might have been a JNI crash)? But all four servers have the same OS, same version, same packages installed, same JVM, and it was the same zip file. Check the md5sum on the zip, it matches. In desperation one of my colleagues writes a script to recursively go through the unpacked version of the app and check the md5sums of all the files. Still matches perfectly, and the same files are present on all machines.
We get the dev team to try the app - there's a bit more variety in our devboxes than servers. Two of them can reproduce the failure, but there's no obvious correlation - one's java 1.5, one's java 1.6. One's Debian, one's Gentoo. For every combination there's another developer with a similar machine where it works fine.
Turns out that one server had been installed with a different filesystem from the other three (reiserfs?), which meant that the directory entries for the files were in a different order. The JVM just lists all the classes in the directory and then loads them in on-disc order, so classes were getting initialized in a different order, and it was that that was ultimately triggering the bug.
The solution is to eliminate any duplicate classes from the classpath and to enforce this in your build process. We use this Maven plugin to do it: https://github.com/ning/maven-duplicate-finder-plugin
Paying attention to the system you'll land on is something that can go lost even to senior developers. And, therein, details, details, details...
Another reason to have a strong ops (or devops) team: Providing/enforcing proper, intended, and thoughtful context and runtime.
P.S. As I now recall, there was also the use case for locale to deliberately vary, although that use case / those instances should have remained orthogonal to those of other locales. Nonetheless, one more possible driver of a mistake that would enable this weakness to occur.
I have to disagree, actually. This kind of problem is exactly where you really benefit from having full-stack people who understand both sides of the system; it would have been very hard for an ops person who didn't know about Java's quirks or a pure dev who didn't know about the unix filesystem to diagnose.
(Such a perspective is also how I determined the problem in the first place -- which actually resulted from a botched fix to a prior problem that I'd identified.)
However, despite the variance -- or risk of same -- that I described, in general we had some very capable and dedicated devops people who put a lot of effort and care into our environments. I get rather uncomfortable considering how things would have been had that not been the case.
It's been a while, and maybe I'm mixing my stories a bit. But I left that role and product mix -- which had very significant security requirements and ramifications -- quite impressed with the role those devops folks played in keeping us safe.
Perhaps what I meant by "strong" goes somewhat in the direction of your description of "full stack". Our senior devops people tended to trend in that direction.
And we didn't have "turf wars". Instead, devops was a partner often throughout the development lifecycle. It helped make sure that the final destination was appropriate, safely, and consistently configured. (My "locale" situation aside; and in such an instance, the setting would subsequently receive heightened and sustained scrutiny, putting a curb on unintended variance of the setting as well as fixing the code that such variance would impact.)
Strong in knowledge and ability, as opposed to simply or foremost in an authority to dictate.
It usually provides them with an opportunity to talk technically about something they know and helps me understand how well they communicate problems and solutions. Plus, it's sometimes fun to hear the stories.
My own personal worst bug (where by worst I mean "had the worst impact") was when we disabled a large chunk of the southern Beijing cell phone system for a short time during the night whilst deploying a field test of new base station hardware. That was a stressful rollout.
It turned out that the problem came down to a single dialog box, when people tried to cancel a process on the page.
The resulting dialog said:
"Do you want to Cancel?"
"OK" or "Cancel"
Rewording the options fixed the issue...
Are you sure you are getting honest answers? Some candidates may be thinking "He'll never hire me if I admit how stupid I was, so I'll use this secondhand or dumbed-down story instead."
It's usually fairly straightforward to see if somebody talks about their own experiences or is retelling a story they simply heard.
Unfortunately i've had a couple of candidates who's response to the question was "I don't really write bugs". Those are the ones I know for sure are lying.
Memory corruption in a video game that I was developing in the 1990s. It took 2-3 days of running attract mode to trigger it, whereupon the game would crash catastrophically.
Solution: Videotaped attract mode for 2 days until it happened. Then I single frame advanced through the 4 frames between its first manifestation and the complete crash of the program. 15 minutes later I knew exactly what was going on and fixed it shortly thereafter.
These days, any bug that survives my best efforts for more than a day usually ends up being a HW/driver issue in equal measure. I've learned a lot since then.
I spent three days in coffee shops with a pen and paper taking first and then second derivatives of the Winkel Tripel formula
I also spent several days churning formulae in bars. Got a lot of weird looks.
Turns out that HP was also shipping a build of Kerberos/GSS libraries that actually relied on this behavior to function properly, and since our own project linking in these libraries did the sane thing and enabled the faults, the Kerberos code would crash.
I won't forget debugging through the library assembly code, wondering how the hell it ever worked. Luckily my education had included a good survey of the computer architecture zoo, where surely someone once upon a time had thought it was a great idea to spec a zero page at low memory addresses.
And of course today I know enough to never believe that a platform must segfault on a null pointer dereference. C undefined behavior works in mysterious ways.
What makes articles like this impressive is that they involve a problem at a different level of abstraction. For example, what if your model didn't work because multiplication broke for specific inputs?
That said, every time I read one of these...I'm always reminded of a legendary bug report...or at least an old one, because I thought I remember reading it in a Usenet newsgroup posting...it involved some malfunctioning hardware and the cause was related to the floor panels and the proximity of people walking around...and I can't find it to store in my bookmarks. Someone here must know what I'm referring to yet can't summon on Google
Proper hardware development labs have antistatic carpet etc., but in embedded software development, one often doesn't have this luxury and has to be aware.
Overly enthusiastic video demonstrating the issue for those who are curious. :)
On Symbian OS, the window manager managed all the screen drawing. All visible apps would be asked to send draw ops to the WM and it would draw them clipped to the apps's windows.
And at UIQ we were adding theme wallpapers and memory hungry graphics faster than out licensees were adding RAM.
And a real problem was running out of RAM drawing the screen. Doing rectangle intersections actually requires allocation, so drawing isn't constant memory.
To speed up drawing we made the WM retain the draw ops. This was transparent to the apps, but a massive performance win. We made a 'transition engine' to smoothly slide between windows and smooth scroll windows and things, at a time when some Nokians confidently told me it wasn't possible :)
But what if our cleverness caused an out-of-memory in the WM? I had a cunning plan...
We intercepted the malloc and, if it failed, we called out a memory manager app to start zapping things. And if a second alloc attempt failed, we started discarding draw op buffers and unloading theme assets.
And this seemingly worked! By making our graphics adapt dynamically to RAM usage rather than ring fencing we got much better app switching because background apps weren't getting unloaded.
And then, just as the first phone with this tech was being tested (manually, by a small army), it would sometimes crash with meaningless stacks.
My team jumped into the challenge thinking we were elite clever sods and all bugs were shallow.
After a few days, I had to start making excuses; we were stumped. The thousand monkey tests should no pattern, only that it was often. Where was the crash coming from?
A lunchtime walk cleared our heads a bit and suddenly the horrid realisation was before us: if the allocation that failed was a bitmap data block, the bitmap itself may be reaped but the stack would resume initialising RAM that malloc didn't think it had any more and in the end some other random bit of data would be interpreted as a memory address and eventually the WM would blit to it...
The phone never shipped because the plug was pulled on UIQ, but I think this bug was fixed and forgotten before then.
With some custom embedded electronics, the ADC would work fine in a lab setting but always throw out garbage in the field. Now, about 500 of these had been running fine in the field themselves since February. Turned out there was a bug in the FPGA that controlled the ADC and at somewhere around 80 degrees fahrenheit, there was enough of a propagation delay that the ADC wouldn't start up correctly. Since the other units were started in February when it was 25 degrees and only a few had been restarted, it wasn't noticed.
That was frustrating.
Another fun one was the rapid degradation of a database when the write-back cache battery on the RAID controller failed on the write-ahead logging disk and nobody was notified.
Right now I've been battling a random corruption NFS bug for a few weeks. Recently thought it was the automounter but the bug has appeared in a few other nodes since:
War story #1:
I designed and programmed a board based on a TI fixed point DSP (5x series). Problem is the software ran for a very short while and then the board would crash. I went through everything I could think of, the software, verifying the reset sequence, memory accesses. Everything looked good. Called TI support. Couldn't figure it out. After I think two weeks of checking everything it turned out that one of the ground pins that was supposed to be connected (it was in my schematic) was left unconnected by the PCB designer. When we brought out the PCB design we saw the via to the ground plane but a tiny little segment between the pad and the via (under the chip) was left unconnected. If you don't hook up all the Vcc and Gnd pins you get undefined behaviour...
War story #2: Odd intermittent very rare behaviour in an application we worked on. Turned out we were using some implementation of shared pointers that used interlocked increment for incrementing the count but didn't use interlocked decrement for decrementing. So very rarely two threads on two cores would hit that and someone would end up with an invalid pointer. That one also took a long time with trying to get some semi-reproducible behaviour to even know where to start looking.
EDIT: One thing I've learnt over the years is that bugs that look impossible to figure out will eventually. The magic time period is around two weeks for those rare super hard bugs. This is from having no clue of what's going on, intermittent weird failures that look impossible to figure out, all you have to do is "do the time" and you can figure those out. I've seen people simply give up and live with things not working and believe that those issues are "unsolvable"...
The guilty commit seemed fairly innocent, but it prompted me to try running `(loop (thread (const null)))`, which immediately segfaulted. `(loop (thread (thunk null)))` didn't. At this point we handed off to the racket devs, and replaced our `(const null)` callbacks with `(thunk null)`. After a few days they worked out what was going on and fixed it.
It was a simple while loop and it took as I remember two weeks to spot the mistake, a missing = 0 in the
while (int c; c < x; c++)
The debugger initialized all memory to zero so it never failed when debugging :)
int A, B, C;
and it was being set something like this:
int* pA = &A; // pointer to contents of "A"
pA = 3; // WTF?
A,B, and C where assumed to be contiguous in memory and the code was treating "C" like the 3rd item in an array starting at "A".
After many, many fruitless debug sessions the problem turned to be that the structure packing was different between two different compilation units. In some compilation units a particular structure was 56 bytes, in others 48 (or something like that). This was Bad.
There was an unterminated pragma-pack which was included in some compilation units but not others. In 32-bit mode it didn't cause any problems, because the structures were optimally packed anyway, but in 64-bit mode, when pointers were 8 bytes, the structures packed differently when the unterminated pragma-pack was included in the header before them.
They had a packet sniffer that we had built for them, so we went to trying to diagnose the issue. I'd send them new versions of the programmer tool, they'd flash the room (which required pinging each and every device in the room and tapping a physical button on it, in one of the largest single-room installations of this brand of mesh network in the world), and they'd send me the logs of the sniffed packets. We could see that the packets for any device you happened to be standing next to would be completely fine, but if you turned your back on it and walked across the room, it started acting up. But they would be fine again if you walked back to it. "It's like it knows you're watching it", said the guy on the phone.
They kept insisting that I had let a virus into their network. Never mind that it was only possible to rewrite the configuration ROM over the air, rewriting the ROM required physical access to the board in the case, clearly it was my fault for all the "unnecessary fiddling" I had been doing recently (i.e. a slew of bug fixes they had requested all involving my predecessor's lack of understanding of the pitfalls of threading and UI in .NET).
I kept telling the client that all the symptoms suggested radio interference from an outside source. They insisted they had never heard of such a thing, ever, in any context, including static on their car's radio. I being "merely" an applications developer and not an electrical engineer, they lacked faith in my explanation and insisted that I "just fix it". How I was supposed to be knowledgeable enough to fix it if I was apparently not knowledgeable to understand what was going on with it was beyond me, but whatever.
"Put it back to what it was before you started f*ing around with it." Revert the code through source control (thank God I had installed SVN when I first arrived at that company, because apparently EEs don't understand that it's not a good idea to keep dated copies of code directories around as "backups"). "This isn't working, I said give me the old version." Send them links to the installer on their own server. "You must have changed it on our server! How did you get access to our servers? This isn't working!"
I finally gave up around midnight and drove the 3 hours to my parents' house for Thanksgiving the next day. I nearly got fired for it.
We shipped them a radio spectrum analyzer and determined for sure, it was radio interference. The hotel opened the room next door and found a baby monitor, still on, fallen behind the dresser. They turned it off and the room responded flawlessly.
I should have quit then, but I needed the money and I was going through some depression issues so I really thought it was my fault. I eventually did get fired from that place, the only place I've ever gotten fired from, for "not working enough overtime" because I was only doing 50 hours a week when the intern fresh out of college needed 60 to get his much simpler tasks done (and often leaving me blocked because of it, but I wasn't allowed to help him with anything because "you're not an electrical engineer", where apparently only electrical engineers know how to code in C). I don't regret it, biggest piece of shit place I've ever been, and just the motivation I needed to get off my ass and finally change my relationship to work. I've been freelancing every since.
I guess that's not so much "my hardest bug", but I did actually fix a bunch of bugs in the process of trying to convince them it was interference and not some mythical radio virus that could corrupt packets in mid-air. And all of it based on phone calls and emails with hex dumps of sniffed radio packets, with being "nothing more" than a lowly applications programmer.
That was a fun 2 days of troubleshooting, I think I 'solved' the problem 3 times before it was actually solved cross-browser.
a) I used to work on deep packet inspection software for a multicore network processor. It was kind of c but with restricted api's and some unique concepts related to multicore. Among the concepts was, same binary being run on multiple cores to process packets, but still no hardware locks, because there was an implicit tag - a kind of a hash computed on 5 tuple (src/dst ip, ports, protocol) to ensure only one core gets packets from one session / 5 tuple.
So the scenario was a protocol parser whose job was to parse some other info along with ip, call an external api to add a subscriber. When this parser was ran for like 10-15 minutes on live setup, it used to seg fault after processing some 60-70 million packets. The behavior was reproducible, but was not occurring at the same time, nor in the same piece of code.
Narrowing down didn't exactly work, since it stopped occurring with either of the subscriber addition api call OR the parser was commented. But each worked perfectly on its own.
Finally, after a couple weeks of long debug cycles and notes, it turned out to be AN IMPLICIT tag switch inside the subscriber addition api. Since we were not locking through apis, the tag switch would lead to same packet being sent to multiple cores, and any where along the line in the follow up code, an allocation (which turns redundant) or a shared mem access or deletion (free) it could turn into a seg fault.
Now implicit switch of locks in the subscribe api was also a documented and needed feature of hardware. Just that it should have been DOCUMENTED in BOLD on the api, which was not the case.
b) In the same dpi product, once we added two fields to look for in the incoming traffic which should not have matched but were still matching in results. Unique thing was, they only fail when those were together and would work fine independently.
Going deeper in their code, showed a strncpy which was intended to use as a safety against strcpy, but with MAX_STRING_SIZE. So basically when the actual string was much shorter, it would just wipe off the entire length with padded zeros in the buffer, there by over writing the originally appended fields to look for. The author seemed to have missed the following comment in strncpy's definition.
"If the end of the source C string (which is signaled by a null-character) is found before num characters have been copied, destination is padded with zeros until a total of num characters have been written to it."
Since then, i have been really careful in choosing to use strncpy instead of strcpy as often mistakenly advised in general.