Bug resolution time depends on how familiar a developer is with the system, how complex the issue is and how impactful the bug is. Not everything can be solved in 24 hours. Not everything has to be solved in 24 hours.
Saying that your developers will solve every problem in 24 hours seems like a toxic pr move.
Yes, that was my reaction as well. I'm still traumatized by an insidious bug in a distributed system that took me around 3 months of nearly exclusive work to diagnose and fix (one-line fix, of course). ENG-168. Never forget.
At a previous company, we were in the early stages of building a massively distributed simulation platform that would power MMOs and government/military simulations. The platform was written in Scala and used Akka extensively (because of reasons). We had a test environment that spun up a decently big game world, and had a bunch of bots run around and do things. It would run overnight.
At some point it was discovered that every once in a while, bots that were supposed to just go back and forth the entire game world forever would get stuck. It was immediately obvious that they were getting stuck at machine boundaries (the big game world was split into a grid, and different machines would run the simulation for different parts of the grid). This suggested the bug was in the very non-trivial code that handled entity migration between machines.
This was a nightmare to debug. Distributed logging isn't fun. Bugs in distributed systems have a tendency to be heisenbugs. We could reproduce the bug more or less reliably, but sometimes it took hours of running the simulation until it manifested; worse, not manifesting for a few hours wasn't a clear signal that the bug had been fixed.
My investigations were broad and deep. I looked at the Kryo serialization protocols at the byte level. I scrutinized the Akka code we were using for messaging. I rewrote bits and pieces of the migration code in the hope it would fix the bug. Many other engineers also looked at all this and found nothing. A Principal Engineer became convinced this had to be a bug in Scala's implementation of Map. I was very close to giving up multiple times.
At some point there was a breakthrough -- another engineer discovered a workaround. A violent but effective one: flushing every cache and other bits of internal state except the ground truth would get the entities unstuck. We added a button to the debug world viewer appropriately labelled YOLO RESYNC. We were so desperate about this bug, we seriously discussed triggering a YOLO RESYNC periodically.
But if YOLO RESYNC fixed the issue, it meant that there was some sort of problem with the state of the system. I spent some more days and weeks diffing the state before and after YOLO RESYNC (more difficult than it sounds in a not-entirely-deterministic distributed simulation) and narrowed it down more and more until I finally found a very subtle bug in our pubsub implementation. I don't remember exactly what the issue was, but there was some sort of optimization to prevent a message from being sent to a recipient under certain conditions that would "guarantee" the recipient would have gotten the message in some other way -- and the condition was very subtly buggy. Fixing it was an one- or two-line change.
I still remember the JIRA ticket: ENG-168. It tested my sanity and my resilience for longer and harder that anything else before or after.
[EDIT] I saved this ticket as a PDF as a traumatic memory. It was in January 2015 so I got some details wrong, the main one being that it only took about two weeks from the bug report (Jan 28) and the fix (Feb 10). I swear it felt like 3 months.
Well you could at least start with solving everything in 24 hours that can be solved in 24 hours. More often than not such bugs take days, weeks, months not because of time in IDE, but because of backlogs, prioritization, time in test and longer release cycles. Streamlining that sounds mostly a win to me.
Note that a bug fixed in 24hrs is also a bug that doesn't have to be fixed later. I mean the development work has to be done at some point anyway, and this may even save some time discussing and bouncing around the issue.
Speaking as one of those developers, we suggested the topic. We are proud of this!
I've worked at several companies you know (https://www.linkedin.com/in/macneale) - and this is the least toxic company I have ever worked at. Hands down. We take pride in running a tight ship.
Reminds me of my stint in software. We could easily to these numbers as well.
Until we outsourced code production to a country halfway round the globe. Completely changed the dynamics from solving interesting problems in nicely written code, to chasing people around you've never met and never will to fix bugs in 1000+ lines functions, while you know it's an exercise in futility as no doubt there will be lots and lots more where that came from.
It's physically impossible to convince non-technical managers (or technical managers too far removed from the code) that this is always a one-way decision that cannot ever be remediated. Inevitably, the codebase will corrode and decay until its Git repo becomes a superfund site and has to be fenced off with razor wire festooned with warning signs to keep the newbies away.
"But they're cheaper!"
"Just fix the bugs!"
Etc...
It's. Just. Not. Possible to convince anyone that it's not like hiring a recent immigrant to clean your yard. Sure, their English might not be so good, but they can hold a broom, right? No big deal! Worst case, if they miss a spot, you can just point to it and they'll fix it.
The destruction of code quality is an invisible, insidious thing that can end billion-dollar companies while management whistles all the way to the bank collecting their bonus cheques for yet another awesome quarter.
Pressure to resolve issues hastily could lead to suboptimal, band-aid solutions rather than thoughtful, robust fixes that address root causes. The relentless pace may be unsustainable and lead to engineer burnout.
On top of all that, if not managed carefully, the 24-hour guarantee could perversely incentivize customers to classify all requests as "bugs" to get faster service, blurring the line with genuine feature requests.
Why optimize for this? Some of the worst jobs I’ve had were the ones where a PM would say “drop whatever sprint items you’re working on and immediately context switch to this bug”. Sounds like this is their regular day? And what happens if a bug comes in Friday @ 4pm?
It's implicit but clear from the article they have thought about this. There's an owner responsible for triage and communication on the incoming reports, and somebody responsible for bugfixing, presumably they have some roster where after so many days they switch duty. This limits the context switching to only one (or two) persons at a time. It also sounds they are trying to make everything work in office hours, so when a bug comes in friday afternoon it'll get fixed on monday.
From the article it sounds like a completely different game than what you mention, which I also experienced at times.
This is a function of priority: it is saying that existing features working perfectly are more important than new features.
It also should come with a culture of treating bugs like you do a production incident: You do retrospectives and figure out what you can change to your process to reduce the likelihood of the same type of bug happening again. You likely also have a more extensive suite of tests to prevent regressions.
Fair enough, especially if structural. I've always felt most comfortable with a work relationship where I don't mind doing some work in evenings or weekends, but where it is also very much ok to do some personal stuff during the weekdays. It's a win for all involved. But that has its limits of course.
It’s not only unreasonable - who expects/needs this? - is also unsustainable.
What happens if 5 bugs get filed on a Friday.
Might be good when you have few users in the beginning but long term I don’t see it happening.
At a certain point it'll hamper ability to ship features quickly, as you go from having to think about the 80% of uses cases, to spending a significant chunk of time anticipating and fixing all kinds of very unlikely edge cases.
I guess it's preference, but I'd rather ship quickly and cover most of my users, and respond to any bugs that do surface (and there's every chance they won't) for obscure issues that probably won't really impact revenue.
I mean, it's a way to set priorities. Even it turns out not to be an iron clad guarantee, having taken that pledge means when people end up taking too much time fixing bugs, it's already consensus that that's not acceptable, and it'll presumably be much easier to make the case to allocate resources to address it.
They are not writing code without bugs. But the rate of incoming bugs varies from project to project. For mature projects without a lot of churn the bug rate decreases over time.
They may also be in a position where they can afford to delay releasing a feature until all bugs are fixed. Most teams on the other hand ship with known bugs, because a feature with bugs today will make more money than a feature without bugs tomorrow.
Because the rest of the industry is managed by people who don't care about code, and puts people who become coder after a couple of udemy videos in position of charge
Read „DoIt“ all the time as long as I read the entire article from top to bottom. But when I browsed to their homepage the font was different and I finally read Dolt with L, instead of DoIt, as in do-it. Ups.
One important element is to provide a workaround (info, instructions, or a small script) ASAP before you start fixing it, especially important for all kinds of blocking issues.
Saying that your developers will solve every problem in 24 hours seems like a toxic pr move.