> "We have to upgrade right now because everything is on fire!",
I don't know exactly how it relates to the broader "tech debt" discussion, but one thing that can make a big difference is to design systems in such a way that there are obvious strain indicators.
To quote Firefly, ways the system can "let you know she's hurting before she keels.".
How exactly you do that depends on what you're building, but I think it's a worthwhile thing to keep in mind.
Can't think of any specific example off the top of my head, but I've noticed that lots of places talk about needing people to be "on call" all the time. We are a very small shop with some fairly high profile clients, and "everything's on fire" situations happen maybe once every year or two. And are then mitigated by fixing the code in such a way that there are diagnostics to let us know if we're ending up in dangerous territory again. Before everything goes Kaboom.
With DevOps you monitor [at least] four metrics: lead time for changes, deployment frequency, mean time to recover/restore service, and change failure rate. These indicators show if you're improving, stagnating, or straining. You can also track more specific service level indicators, the amount of toil to project work, and tech debt versus feature backlogs.
The "everything's on fire" metaphor also has a larger context. Sure, when people are getting woken up in the middle of the night because the site is down, shit's on fire. But also "we're constantly missing our deadlines" is shit's on fire, "our customers are not satisfied" is shit's on fire, "our overhead is way too high for what we're delivering" is shit's on fire. If you're out on the water and building a new boat because yours is on fire, it's a little late.
My experience echoes yours. We had 2 senior devs and 1 junior. We had one major event in 3 years -- a long weekend where things were on fire with code being updated on the half hour. But afterwards, we had so many indicators of when things were even approaching that level the job got almost boringly simple afterwards.
I don't know exactly how it relates to the broader "tech debt" discussion, but one thing that can make a big difference is to design systems in such a way that there are obvious strain indicators.
To quote Firefly, ways the system can "let you know she's hurting before she keels.".
How exactly you do that depends on what you're building, but I think it's a worthwhile thing to keep in mind.
Can't think of any specific example off the top of my head, but I've noticed that lots of places talk about needing people to be "on call" all the time. We are a very small shop with some fairly high profile clients, and "everything's on fire" situations happen maybe once every year or two. And are then mitigated by fixing the code in such a way that there are diagnostics to let us know if we're ending up in dangerous territory again. Before everything goes Kaboom.