Hacker News new | past | comments | ask | show | jobs | submit login

> Unless it's one of those cursed things installed at the customer thousands of miles away that never happens back in the lab.

> Once you can actually reproduce the issue, you've often done 80-99+% of the work already.

War story...

I spent over a year on-and-off between adjacent sprints circa peak COVID chasing an elusive intermittent bug that kept coming back after failing formal acceptance testing performed at a designated off-site location. Despite requesting that I be flown to this location to personally replicate the issue within days of the initial report (unable to replicate in local lab and familiar enough with the system's design to intuit the superficially unbounded depth of the rabbit hole) and every subsequent failed test report thereafter, PM persistently decided to allocate funds elsewhere and just kept kicking this product to the back of the pipeline.

Nearing project end as the last backlogged issue in queue, I managed to isolate the root cause to a race condition in a firmware ISR of an embedded microcontroller buried 3 levels deep in mixed-signal hardware. This subroutine's execution path interacted with an async signal from an instrument that was occasionally ~10 ms slower between bursts than any of the references we had in our lab...which really shouldn't have mattered because the entire flow control setup was designed to be timing insensitive and executed by non-realtime host...except this one buried onion subroutine that really wasn't.

Attaining local replication was an imperative, which required some creative speculation and a kludge hardware mock, but once I got to a state of repeatably mirroring the failure mode at will, the problem was effectively 95% solved; what followed was a quick subroutine rewrite, anti-pattern scrubdown review, host software update to account for new payload, and instructing off-site techs to field reprogram before the next scheduled test event. I'd never been happier to tag an issue resolved and finally merge an all but abandoned WIP branch.

To this day, I still don't know with certainty what caused that catalyst instrument to occasionally perform ~10 ms slower relative to our references. In the end, we indirectly absorbed at least 50x the notional cost of that site visit when the dust finally settled.






The deeper you get debugging, the more you have to reverse engineer, modify, even outright hack the target. Awesome war story, I know how good that must have felt.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: