Interesting. It sounds like “all” service state management (admin config, infra,...

Interesting. It sounds like “all” service state management (admin config, infra, topology) is discoverable/legible for meta. I think that contrasts with AWS where there is a strong DevTools org, but many services and integrations are more of an API centric service-to-service model with distributed state which is much harder to observe. Every cloud provider I know of also has a (externally opaque) division between “native” cloud-service-built-on-cloud-infra and (typically older) “foundational” services that are much closer to “bare metal” with their own bespoke provisioning and management. Ex EC2 has great visibility inside of their placement and launch flows, but itll never look like/interop with cfn & cloudtrail that ~280 other “native” services use.

Definitely agree that the bulk Of “impact” is back to changes introduced in the SDLC. Even for major incidents infrastructure is probably down to 10-20% of causes in a good org. My view in GP is probably skewed towards major incidents impairing multiple services/regions as well. While I worked on a handful of services it was mostly edge/infra side, and I focused the last few years specifically on major incident management.

Id still be curious about internal system state and faults due to issues like deadlocked workflows, incoherent state machines, and invalid state values. But maybe its simply not that prevalent.