
Ask HN: What observability effort have had the best impact on your company? - jturolla
I&#x27;m currently leading a observability team at a 150+ engineer, 150+ services company. We spend the last 6 months building infrastructure around Prometheus, including instrumentation, alerting rules, high availability and scalability&#x2F;sharding. We have a multi-kubernetes-cluster setup and multi-customer-shards with a high growth rate.<p>When we started, our observability stack was composed of Riemann, Splunk and thousands of unloved cloudwatch alerts. We felt Riemann was very difficult to work with for our setup and our multiple shards, and we evaluated what options we had, finally choosing Prometheus.<p>We spend a lot of time pairing with squads so they can learn Prometheus and work on their own creating instrumentations and alerts.<p>We&#x27;re in the middle of the rollout for Prometheus. Roughly 50% of the services and squads are using metrics and alerts, and this is on-track to becoming 90% until the end of the next quarter.<p>Having a metrics and alerting platform ready to use is making engineers much more confident on their services, but we&#x27;re still facing some incidents that are reported by customer support instead of engineering tools, which could mean either we&#x27;re missing something, or we just have to double down on the rollout to make sure everything is covered and alerting correctly.<p>Could you share your thoughts and experience around observability?
======
wjossey
One thing I'd encourage is to build a culture of looking at graphs daily as a
matter of practice. Alerts are only as effective as the knowledge of the
writer, and human based detection is almost-always the first step to better
alerting.

By looking at graphs on a daily basis, one can become much more rapid in being
able to diagnose issues quickly, and also understand what the root cause is of
the underlying issue (assuming you're measuring XYZ that broke). So many times
I was able to look at a graph and go, "Something isn't right", long before an
alert would go off. I wasn't magical, I just had spent a little bit of time
every day reviewing graphs, so that I could pick out subtle changes across a
handful of graphs very rapidly, which would have also been hard to pre-
emptively build checks around.

One additional thought is to not over-alert. Alert fatigue is your greatest
enemy, and needs to be avoided at all costs. If you're not reacting to an
alert, you should question whether or not you need it.

~~~
jturolla
> Alerts are only as effective as the knowledge of the writer

We're enforcing that alerts have a link to a playbook where there are more
datails about the issue and how to mitigate them.

About over-alerting, we had already been over alerting before we began the
taskforce, and one of our efforts is removing cloudwatch and riemann alerts
that are usually noisy and do not offer a lot of insight on the real issues.

