
Ask HN: Source of truth for Devops, because GitHub is just not enough - itielshwartz
Hi guys,<p>I’m trying to understand what’s the best way to improve the RCA (root cause analysis) process in a distributed system, both technical (microservices) and organizational (devops, devs, analysts).<p>No matter where I worked (as a Devops&#x2F;SRE&#x2F;Backend) I noticed a recurring problem - there are just too many moving pieces and not enough visibility to monitor them.<p>Detecting that there is a problem becomes easier by using tools like APM (datadog) + Exception management (Sentry) + Logs (Kibanha). But from my experience even when you know something is broken, it’s hard to find out why. It becomes even more prominent when working with multiple teams (dev, devops, infra, analysts) who use various systems &amp; tools.<p>While debugging a problem I find myself forced to open several tools (Kibanha + Datadog + AWS + Github + Sentry + Slack) and use different techniques in order to pinpoint the real root cause.<p>I know Github should be the single source of truth, but from my experience this is not the case for most problems. For example: infra changes that were done via the AWS console, manual schema changes, recent deploy&#x2F;rollback, Cron runs that we forgot about, an undocumented DB change, etc.<p>The RCA tends to lead me away from Github and into the dark corners of the system. To mitigate this pain, I’ve found several solutions that helped us. For example:<p>Important Cron runs status (start&#x2F;end) are sent to Kibana<p>Asked people to write about infra config changes in a dedicated Slack channel (but sometime people just forgot)<p>Some questions that I find interesting:<p>Do you also feel this pain? Or is it just me?<p>What are the best practices for tracking all of these changes?<p>Did you implement some in-house solutions in order to solve this?<p>How to reduce the time it takes to find the root cause?<p>Is it just me or slack become a super important tool in the process of tracking changes?<p>Would be happy to get any advice or feedback.
======
verdverm
You are on point about observability, that is a keystone in the mitigation and
RCA process.

I don't think source code can be the source of truth. It lacks the
observability of live systems. How would you know what's going wrong without
logs and metrics?

You infra ought to be code (terraform et al) but this still does not cover the
state of the live system.

DevOps, and generally development, will require many systems to get a RCA. The
many moving parts and information sources are part of modern development and
systems. Too many technologies and tools to unify.

Slack should not be anything more than a communication system, def not a
source of truth or tracking platform.

Sounds like your company needs to improve its practices.

#DevOpsLife

