I’m trying to understand what’s the best way to improve the RCA (root cause analysis) process in a distributed system, both technical (microservices) and organizational (devops, devs, analysts).
No matter where I worked (as a Devops/SRE/Backend) I noticed a recurring problem - there are just too many moving pieces and not enough visibility to monitor them.
Detecting that there is a problem becomes easier by using tools like APM (datadog) + Exception management (Sentry) + Logs (Kibanha). But from my experience even when you know something is broken, it’s hard to find out why. It becomes even more prominent when working with multiple teams (dev, devops, infra, analysts) who use various systems & tools.
While debugging a problem I find myself forced to open several tools (Kibanha + Datadog + AWS + Github + Sentry + Slack) and use different techniques in order to pinpoint the real root cause.
I know Github should be the single source of truth, but from my experience this is not the case for most problems. For example: infra changes that were done via the AWS console, manual schema changes, recent deploy/rollback, Cron runs that we forgot about, an undocumented DB change, etc.
The RCA tends to lead me away from Github and into the dark corners of the system. To mitigate this pain, I’ve found several solutions that helped us. For example:
Important Cron runs status (start/end) are sent to Kibana
Asked people to write about infra config changes in a dedicated Slack channel (but sometime people just forgot)
Some questions that I find interesting:
Do you also feel this pain? Or is it just me?
What are the best practices for tracking all of these changes?
Did you implement some in-house solutions in order to solve this?
How to reduce the time it takes to find the root cause?
Is it just me or slack become a super important tool in the process of tracking changes?
Would be happy to get any advice or feedback.