> IME a very very large number of impacting incidents arent strictly tied to “a”...

donavanm · 2024-08-24T00:08:51 1724458131

See other child reply upthread, lots of service-to-service style interactions that look more like distributed state than a CR. And my view was across an org scope where even “infrequent” quickly accumulated. AWS is on the order of 50,000 SDEs, running 300 public services (plus a multiple more internal), and each team/microservice with 50 independent deployment targets.

UK-AL · 2024-08-23T09:28:35 1724405315

At my place 90% of them are 3rd parties going down, and you can't do much other than leave. But the new 3rd parties are just as bad. All you can do gracefully handle failure.