> IME a very very large number of impacting incidents arent strictly tied to “a” code change, if any at all
Usually this implies there are bigger problems. If something keeps breaking without any change (config / code) then it was likely always broken and just ignored.
So when companies do have most of the low hanging fruit resolved it's the changes that break things.
I've seen places where everything is duck taped together but BUT it still only breaks on code changes. Everyone learns to avoid stressing anything fragile.
See other child reply upthread, lots of service-to-service style interactions that look more like distributed state than a CR. And my view was across an org scope where even “infrequent” quickly accumulated. AWS is on the order of 50,000 SDEs, running 300 public services (plus a multiple more internal), and each team/microservice with 50 independent deployment targets.
At my place 90% of them are 3rd parties going down, and you can't do much other than leave. But the new 3rd parties are just as bad. All you can do gracefully handle failure.
Usually this implies there are bigger problems. If something keeps breaking without any change (config / code) then it was likely always broken and just ignored.
So when companies do have most of the low hanging fruit resolved it's the changes that break things.
I've seen places where everything is duck taped together but BUT it still only breaks on code changes. Everyone learns to avoid stressing anything fragile.