
Things We Forgot to Monitor - jehiah
http://word.bitly.com/post/74839060954/ten-things-to-monitor?h=2
======
AznHisoka
Also: 1) Maximum # of open file descriptors

2) Whether your slave DB stopped replicating because of some error.

3) Whether something is screwed up in your SOLR/ElasticSearch instance so it
doesn't respond to search queries, but respond to simple heartbeat pings.

4) If your Redis db stopped saving to disk because of lack of space, or not
enough memory, or you forgot to set overcommit memory.

5) If you're running out of space in a specific partition you usually store
random stuff like /var/log.

I've had my ass bitten by all of the above :)

~~~
Gracana
> Maximum # of open file descriptors

Augh. I ran one of my servers _hard_ into that wall, and now it's something I
watch. At least I learned from that mistake.

~~~
wtracy
I learned the hard way that MySQL creates a file descriptor for every database
partition you create. Someone had a script that created a new partition every
week...

~~~
pbhjpbhj
So after 5000 years you were running out?

~~~
wtracy
I forget the details, but practically speaking the database keeled over after
some 200 or 500 files were open at the same time.

------
otterley
Swap rate (as opposed to space consumed) is probably the #1 metric that
monitoring agents fail to report.

One thing that drives me nuts is how frequently monitoring agents/dashboards
report and graph only free memory on Linux, which gives misleading results.
It's fine to report it, but to make sense of it, you have to stack free memory
along with cached and buffered memory, if you care about what's actually
available for applications to use.

Another often-overlooked metric that's important for web services in
particular is the TCP accept queue depth, per listening port. Once the accept
queue is drained, remote clients will get ECONNREFUSED, which is a bad place
to be. This value is somewhat difficult to attain, though, because AFAIK Linux
doesn't expose it.

~~~
justincormack
Yes I can't find the socket backlog anywhere in Linux. FreeBSD exposes it via
kqueue
[http://www.freebsd.org/cgi/man.cgi?query=kqueue](http://www.freebsd.org/cgi/man.cgi?query=kqueue)
through the data item in EVFILT_READ.

~~~
otterley
With FreeBSD it's even easier; you can use "netstat -L".

------
bradleyland
Interestingly, an out-of-the-box Munin configuration on Debian contains nearly
all of these. I recommend setting up Munin and having a look at what it
monitors by default, even if you don't intend to use it as your monitoring
solution.

~~~
hansjorg
Installation on Debian/Ubuntu is also as simple as installing the munin
package (munin-node for subsequent hosts) and pointing a webserver at the
right directory.

Extremely valuable when something is acting up.

------
tantalor
Some people, when confronted with a problem, think “I know, I'll send an email
whenever it happens.” Now they have two problems.

~~~
marcosdumay
I really don't get where you are going with that.

Are you arguing that alerts are useless, and we must fix the issue for once?
Because if so, I'd point that some things can not be fixed (because the Earth
is finite, we don't know all things, etc) and you are better alerted sooner,
rather than later.

Now, if you are arguing that email is not the right medium for an alert, well,
what medium is better? Really, I can't think of any single candidate. Yeah,
email may go down, that's why you complement it with some system external to
your network (a VPS is cheap, a couple of them in different providers is
almost flawless, and way cheaper than any proprietary dashboard). Yes there is
some delay involved, that should be of a few minutes at most, because you
create some addresses specifically for the alerts, and make all hell break
loose then a message gets there. Some standard IM protocol that federated
between all your net (and external point of control), could be reached from
anywhere, and had plenty of support on all kinds of computers would be better,
but it does not exist.

~~~
aryastark
I think you're being obtuse.

Once you start sending emails for things, you start sending emails for
_everything_. It's easy to fall into the trap of not accurately categorizing
what is critical (like real, real, critical, I mean it this time guys!) and
what are merely _statuses_. So what happens is everything starts being
ignored, and your systems become obscure black boxes again.

~~~
hueving
I think you were the one being obtuse. There is no assumption that you will
start receiving useless email status updates. In fact, most reasonable
monitoring tools only email when a status changes to a problem state.

~~~
dredmorbius
_most reasonable monitoring tools_

20+ years of experience tells me most monitoring tools aren't reasonable.

~~~
hueving
Then don't use them? My point is that there is nothing wrong with email
alerts, so the statement about them being a problem sounds like a
misconfiguration or a failure to understand how to setup email filters.

~~~
dredmorbius
_there is nothing wrong with email alerts_

You're wrong.

As a sysadmin, I typically receive something on the order of 1,000 to 10,000
emails daily (the specifics vary by the system(s) I'm admining). Staying on
top of my email stream is a significant part of my job, both in _not_ ignoring
critical messages which have been lost, misfiled, or spamfiltered, and in
getting bogged down in verbose messages which convey no real information.

Alerts which tell me nothing have a negative value: they obscure real
information, they don't convey useful information, and each person who comes
on to the team has to learn that "oh, those emails you ignore", write rules to
filter or dump them, etc.

Worse: if the alerts _might_ contain useful information, _that_ fact has to be
teased out of them.

The problem with emails such as that is that they're logging or reporting
data. They should be logged, not emailed, and with appropriate severity (info,
warning, error, critical). Log analysis tools can be used to search for and
report on issues from there.

As I said: in a mature environment, much of my work goes into _removing_
alerts, alert emails, etc., which are well-intentioned but ultimately useless.

~~~
hueving
>As a sysadmin, I typically receive something on the order of 1,000 to 10,000
emails daily

Sorry, but you're not a very good sysadmin then. You have chosen poor tools or
do not understand how to distill the information. Knowing that, I can see why
you think email alerts don't work. They are effectively broken FOR YOU.

~~~
ersii
And you don't think vendors have a responsibility to reflect upon the way they
do alerts and/or service monitoring?

It's usually not the system administrators that get to decide what the
Corporate Overlords purchases or who they do business with. So I think it's
pretty unfair to blame the admins for "choosing poor tools".

------
dredmorbius
The corollary of this post is "things we've been monitoring and/or alerting on
which we shouldn't have been".

Starting at a new shop, one of the first things I'll do is:

1\. Set up a high-level "is the app / service / system responding sanely"
check which lets me know, from the top of the stack, whether or not everything
else is or isn't functioning properly.

2\. Go through the various alerting and alarming systems and generally dialing
the alerts _way_ back. If it's broken at the top, or if some vital resource is
headed to the red, let me know. But if you're going to alert based on a
cascade of prior failures (and DoS my phone, email, pager, whatever), then
STFU.

In Nagios, setting relationships between services and systems, for alerting
services, setting thresholds appropriately, etc., is key.

For a lot of thresholds you're going to want to find out _why_ they were set
to what they were and what historical reason there was for that. It's like the
old pot roast recipe where Mom cut off the ends of the roast 'coz that's how
Grandma did it. Not realizing it was because Grandma's oven was too small for
a full-sized roast....

Sadly, that level of technical annotation is often lacking in shops,
especially where there's been significant staff turnover through the years.

I'm also a fan of some simple system tools such as sysstat which log data that
can then be graphed for visualization.

------
jlgaddis
Be sure to monitor your monitoring system as well (preferably from outside
your network/datacenters)! If you don't have anything else in place, you can
use Pingdom to monitor one website/server for free [0].

I was off work for a few months recently (motorcycle wreck) and removed my
e-mail accounts from my phone. Now, I have all my alerts go to a specific
e-mail address and those are the only mails I receive on my phone. It has
really helped me overcome the problem of ignoring messages.

[0]: [https://www.pingdom.com/free/](https://www.pingdom.com/free/)

------
comice
We monitor outgoing smtp and http connections from anything that requires
those services.

And the best general advice I have is split your alerts into "stuff that I
need to know is broken" and "stuff that just helps me diagnose other
problems". You don't want to be disturbing your on-call people for stuff that
doesn't directly affect your service (or isn't even something you can fix).

------
mnw21cam
Also, are your backups working.

------
jsmeaton
We had a perfect storm of problems only 2 weeks ago.

1\. A vendor tomcat application had a memory leak, consumed all the RAM on a
box, and crashed with an OOM

2\. The warm standby application was slightly misconfigured, and was unable to
take over when the primary app crashed

3\. Our nagios was configured to email us, but something had gone wrong with
ssmtp 2 days prior, and was unable to contact google apps

3a. No one was paying any attention to our server metric graphs / We didn't
have good enough "pay attention to these specific graphs because they are
currently outside the norm"

A very embarrassing day for us that one.

We're now working on better graphing, and have set up a basic ssmtp check to
SMS us if there is an issue. Monitoring is hard.

~~~
marcosdumay
> and have set up a basic ssmtp check to SMS us if there is an issue.

And what will happen when the network (or the alert server) is down?

You must put some check outside your network, with independent infrastructure.
Adding another protocol on the same net is still subject to Murphy law.

~~~
berkay
Independent infrastructure is a good idea but not always feasible for
everyone. At OpsGenie, to resolve this problem, we came up with a solution we
refer as "heartbeat monitoring". This basically allows monitoring tools to
send periodic heartbeat messages to us that indicate that the tools is up and
can reach us. If we don't receive heartbeat messages from them in 10 minutes,
we generate an alert and notify the admins. Not out of band management but
does the trick to prevent situations like jsmeaton described.

[http://support.opsgenie.com/customer/portal/articles/759603-...](http://support.opsgenie.com/customer/portal/articles/759603-heartbeat-
monitoring)

------
sp332
You're using icanhazip.com in production? I see from a quick Google search
that Puppy Linux seems to use it in some scripts, but how reliable is it?

~~~
jphines
$ curl -i -k -L icanhazip.com

HTTP/1.1 200 OK

Date: Mon, 10 Feb 2014 20:13:28 GMT

Server: Apache

Content-Length: 15

Content-Type: text/plain; charset=UTF-8

X-RTFM: Learn about this site at
[http://bit.ly/14DAh2o](http://bit.ly/14DAh2o) and don't abuse the service

X-YOU-SHOULD-APPLY-FOR-A-JOB: If you're reading this, apply here:
[http://rackertalent.com/](http://rackertalent.com/)

X-ICANHAZNODE: icanhazip2.nugget

Would seem only fair. :D

------
baruch
About reboot monitoring, I suggest to use kdump to dump the oops information
and save it for later debugging and understanding of the issue. It may even be
an uncorrectable memory or pcie error you are seeing and the info is logged in
the oops but is hard to figure otherwise. Also, if you consistently hit a
single kernel bug you may want to fix it or workaround it.

------
lincolnpark
Also, are your API endpoints working properly.

~~~
dredmorbius
Can you expand on that?

~~~
AznHisoka
Ha I can.

Sometimes api providers change the damn response format. Or their urls change.
Or they blocked your ip without notifying you.

~~~
dredmorbius
Thanks.

I was thinking some sort of end-point test myself, hadn't considered the
specific case of APIs.

------
jlgaddis
I have gear in three different facilities and I'm typically visiting any of
them unless I'm installing hardware or replacing it. Shortly after starting at
$job, I realized there was no monitoring of the RAID arrays in the servers we
have. That could have ended badly.

------
herokusaki
How oversold your VPS provider's server is commonly blamed for slowdown but
rarely measured.

------
stephengillie
Between PRTG and Windows, almost all of that is handled for us. And PRTG can
call OMSA by SNMP.

