

Monitoring Theory - mmt
http://teddziuba.com/2011/03/monitoring-theory.html

======
mmt
Ted neglects the last of the four categories: actionable but not informative.

To this I assign many of the resource utilization alerts, such as disk or
memory full. They're certainly actionable, but, in very many cases, they don't
tell you if a critical service is down because of it. Arguably, these are
another kind of "cool story, bro" alert.

Regardless, they're a common frustration because it's something that's so easy
to monitor, it's even an example/default in Nagios, along with the oh so
uninformative load average[1]. It's a specific giving in to the general
temptation of monitoring something just because it's easy.

[1] I routinely remove this from all monitoring systems I touch, even if it
means modifying source and custom compiling, since, invariably, someone will
point to it, without even knowing what it means to be on the run queue.

~~~
wladimir
Memory and disk full alerts don't tell if a critical service is down, but when
your disk or memory fills up you can be sure critical services _will_ fail.
They also tell you of resource leaks. IMO, that's as informative and
actionable as it gets.

~~~
mmt
I disagree. A disk can be full with no critical service ever failing,
especially if it's something like a root disk but the app is on another
device.

For memory, depending on the definition of "full," it, too, can be just fine.
Even having something like individual app servers die and get restarted by
Apache isn't such a bad thing if the cause is a slow enough memory leak.

------
mmt
>then you can sure as hell "spin up more synchronous workers".

And even with a free load balancer like HAproxy, you can have a set of
fallback machines usable by any backend that's over-capacity, with the spin-up
being on-demand, like, say, with Apache workers.

