
Simple alerts for missing metrics data with a focus on Wavefront - ahakanbaba
https://blog.box.com/blog/handling-missing-metrics-wavefront/
======
killertypo
I've been in environments without any sense of monitoring; pingdom and esx
health checks that provide only a tiny frame of reference to the real health
of an application.

Being able to know the true health of a service is an absolute godsend.

So many times a service had been dead or gone for hours before anyone noticed
(well our customers noticed, but it has to funnel up the pipeline from
customer, to support, to engineering) before we were made aware of a real
issue.

Nothing says good PR like not knowing you've been dead in the water for half a
day and have no idea.

~~~
ahakanbaba
I absolutely agree. For a service with any availability guarantees there has
to be rigorous monitoring and alerting.

This also holds for services that have internal clients. In other words, if
your output is consumed only by other services in the same company, the same
high monitoring standards must apply. Otherwise failure detection becomes very
delayed and the productivity of many teams gets affected. There is no worse
buzzkill than explaining other service owners what is wrong with their
application.

One other important lesson we have earned is that alerts require time to
mature. The thresholds need to be trained, the alert formulation needs to be
revised. Our alerts usually give couple of false positives in the first two
weeks of their creation. During these two weeks we frequently improve the
conditions of alerts.

