Hacker News new | comments | show | ask | jobs | submit login
Things We Forgot to Monitor (word.bitly.com)
232 points by jehiah on Feb 10, 2014 | hide | past | web | favorite | 61 comments



Also: 1) Maximum # of open file descriptors

2) Whether your slave DB stopped replicating because of some error.

3) Whether something is screwed up in your SOLR/ElasticSearch instance so it doesn't respond to search queries, but respond to simple heartbeat pings.

4) If your Redis db stopped saving to disk because of lack of space, or not enough memory, or you forgot to set overcommit memory.

5) If you're running out of space in a specific partition you usually store random stuff like /var/log.

I've had my ass bitten by all of the above :)


6) Free inodes (as distinct from space) per filesystem.


Similar to free inodes, you should also check for maximum number of directories. dir_index option helps, but I've seen it become a problem.


There's a maximum number of directories? On what filesystem is that?


ext3 without dir_index has a limit of 32K directories in any one directory.

Where I saw it crop up was 32K folders under /tmp on a cluster system. So no it's not a limit on number of directories entirely (that's inodes), but rather how many subdirectories you can have.

http://en.wikipedia.org/wiki/Ext4#Features <-- Fixes 32K limit


ext3/4 has really poor large-directory performance, even with dir_index, especially if you are constantly removing and readding nodes. I would highly recommend XFS for large-directory use cases.


I got bit by this once, i think it was related to a maximum of 32k hardlinks per inode, which effectively sets a limit of 32k subdirs since each subdir has a hardlink to ".."


> Maximum # of open file descriptors

Augh. I ran one of my servers hard into that wall, and now it's something I watch. At least I learned from that mistake.


Related to this, if you've ever built/run anything on Solaris, you probably found out the hard way that even in modern times, fdopen() in 32-bit apps only allows up to 255 fds because they oh so badly want to preserve ages old ABI. Funny bug to hit at runtime in production when you aren't aware of this compatibility "feature".


I learned the hard way that MySQL creates a file descriptor for every database partition you create. Someone had a script that created a new partition every week...


So after 5000 years you were running out?


I forget the details, but practically speaking the database keeled over after some 200 or 500 files were open at the same time.


X) Number of cgroups. We were getting slow performance, apparently related to slow IO, but nothing stood out as being the culprit. Turns out, since vsftpd was creating cgroups and not removing them, the pseudo-filesystem /sys/fs/cgroup had myriads of subdirectories (each representing a cgroup), and whenever something wanted to create a new cgroup or access the list of cgroups, this counted as listing that pseudo-directory, which counted as IO.

Fixed by using the undocumented option isolate_network=NO in vsftpd.conf.


Feels like this list (and the original post) are problems caused by:

* lack of proper/default monitoring advocated for your tools (2), (4).

* Choosing poor (default/recommended) settings (1), (4).

* Keeping stateless server/instances when you don't need to (5), (6).

* Not tracking performance as part of monitoring (3), (4)

Albeit, I have made the same mistakes too.

edit: formatting


Swap rate (as opposed to space consumed) is probably the #1 metric that monitoring agents fail to report.

One thing that drives me nuts is how frequently monitoring agents/dashboards report and graph only free memory on Linux, which gives misleading results. It's fine to report it, but to make sense of it, you have to stack free memory along with cached and buffered memory, if you care about what's actually available for applications to use.

Another often-overlooked metric that's important for web services in particular is the TCP accept queue depth, per listening port. Once the accept queue is drained, remote clients will get ECONNREFUSED, which is a bad place to be. This value is somewhat difficult to attain, though, because AFAIK Linux doesn't expose it.


> One thing that drives me nuts is how frequently monitoring agents/dashboards report and graph only free memory on Linux, which gives misleading results. It's fine to report it, but to make sense of it, you have to stack free memory along with cached and buffered memory, if you care about what's actually available for applications to use.

Even that is misleading. It's actually non-trivial to find out exactly how much "freeable" memory one has on a linux system these days as not all the cached memory bits are truly freeable.


Even then there's some wrinkles; the anon shared memory used by e.g. the Oracle SGA will show up as cached memory, but evicting it is a no-no.


Yes I can't find the socket backlog anywhere in Linux. FreeBSD exposes it via kqueue http://www.freebsd.org/cgi/man.cgi?query=kqueue through the data item in EVFILT_READ.


With FreeBSD it's even easier; you can use "netstat -L".


Swap rate still looks like the wrong metric. It'd be better to have the rate of swap lookups, excluding all writes.


swap-in rate, to be more specific. swap-outs aren't incredibly worrisome.


That's backwards: things like mmap() will generate page-in activity during normal operation. page-outs means that the operating system had to evict something to satisfy other memory requests, which is what you really want to know.


swapouts and pageouts aren't identical in Linux, and are instrumented separately (pswpout and pgpgout, respectively; see /proc/vmstat). mmap() and other page-ins won't be counted under the swap statistics.

A pageout might suggest memory pressure, but not nearly as much as a swapout does. (pgmajfault is a better indicator.) Writing dirty pages is just something the kernel does even when there's no memory pressure at all. Also, unfortunately you can't use pgpgout for anything useful as ordinary file writes are counted there.


Interestingly, an out-of-the-box Munin configuration on Debian contains nearly all of these. I recommend setting up Munin and having a look at what it monitors by default, even if you don't intend to use it as your monitoring solution.


Installation on Debian/Ubuntu is also as simple as installing the munin package (munin-node for subsequent hosts) and pointing a webserver at the right directory.

Extremely valuable when something is acting up.


Some people, when confronted with a problem, think “I know, I'll send an email whenever it happens.” Now they have two problems.


I really don't get where you are going with that.

Are you arguing that alerts are useless, and we must fix the issue for once? Because if so, I'd point that some things can not be fixed (because the Earth is finite, we don't know all things, etc) and you are better alerted sooner, rather than later.

Now, if you are arguing that email is not the right medium for an alert, well, what medium is better? Really, I can't think of any single candidate. Yeah, email may go down, that's why you complement it with some system external to your network (a VPS is cheap, a couple of them in different providers is almost flawless, and way cheaper than any proprietary dashboard). Yes there is some delay involved, that should be of a few minutes at most, because you create some addresses specifically for the alerts, and make all hell break loose then a message gets there. Some standard IM protocol that federated between all your net (and external point of control), could be reached from anywhere, and had plenty of support on all kinds of computers would be better, but it does not exist.


I got the GP's point immediately: He means that system administrators already get an enormous volume of email. Send them another email and it'll get ignored, deleted, or put at the bottom of a gigantic to-do list.

For airline pilots, an excessive number of warnings themselves (bells, alarms, audible warnings) are known to distract the pilots and cause errors.


I think you're being obtuse.

Once you start sending emails for things, you start sending emails for everything. It's easy to fall into the trap of not accurately categorizing what is critical (like real, real, critical, I mean it this time guys!) and what are merely statuses. So what happens is everything starts being ignored, and your systems become obscure black boxes again.


I think you were the one being obtuse. There is no assumption that you will start receiving useless email status updates. In fact, most reasonable monitoring tools only email when a status changes to a problem state.


most reasonable monitoring tools

20+ years of experience tells me most monitoring tools aren't reasonable.


Then don't use them? My point is that there is nothing wrong with email alerts, so the statement about them being a problem sounds like a misconfiguration or a failure to understand how to setup email filters.


there is nothing wrong with email alerts

You're wrong.

As a sysadmin, I typically receive something on the order of 1,000 to 10,000 emails daily (the specifics vary by the system(s) I'm admining). Staying on top of my email stream is a significant part of my job, both in not ignoring critical messages which have been lost, misfiled, or spamfiltered, and in getting bogged down in verbose messages which convey no real information.

Alerts which tell me nothing have a negative value: they obscure real information, they don't convey useful information, and each person who comes on to the team has to learn that "oh, those emails you ignore", write rules to filter or dump them, etc.

Worse: if the alerts might contain useful information, that fact has to be teased out of them.

The problem with emails such as that is that they're logging or reporting data. They should be logged, not emailed, and with appropriate severity (info, warning, error, critical). Log analysis tools can be used to search for and report on issues from there.

As I said: in a mature environment, much of my work goes into removing alerts, alert emails, etc., which are well-intentioned but ultimately useless.


>As a sysadmin, I typically receive something on the order of 1,000 to 10,000 emails daily

Sorry, but you're not a very good sysadmin then. You have chosen poor tools or do not understand how to distill the information. Knowing that, I can see why you think email alerts don't work. They are effectively broken FOR YOU.


And you don't think vendors have a responsibility to reflect upon the way they do alerts and/or service monitoring?

It's usually not the system administrators that get to decide what the Corporate Overlords purchases or who they do business with. So I think it's pretty unfair to blame the admins for "choosing poor tools".


The point being: delegating prioritization and categorization to a human in real-time is lazy and dangerous. As much as possible humans should only receive notifications when something requires action or is too complex to determine that programatically.


Some standard IM protocol that federated between all your net (and external point of control), could be reached from anywhere, and had plenty of support on all kinds of computers would be better, but it does not exist

I would recommend an SMS sent via GSM modem for out-of-band emergency notifications.


Or a service like Twilio, with an HTTP API for this.


Hospitals have a similar problem -- too many devices with too many alarms. As many as 10,000/day in a busy nursing floor.

NPR covered this a few days back, I've written on it at more length:

http://www.npr.org/blogs/health/2014/01/24/265702152/silenci...

http://www.reddit.com/r/dredmorbius/comments/1x0p1b/npr_sile...


"What if the email goes down? I know I'll send an email"


Keyboard missing, press F1 to continue


That's actually a case where sending a regular ping mail to several sentinal systems which report on the LACK of an email can be useful.

Reminds me of a few times the email queues got backed up to hell and beyond. Fuck you, Yahoo.


The corollary of this post is "things we've been monitoring and/or alerting on which we shouldn't have been".

Starting at a new shop, one of the first things I'll do is:

1. Set up a high-level "is the app / service / system responding sanely" check which lets me know, from the top of the stack, whether or not everything else is or isn't functioning properly.

2. Go through the various alerting and alarming systems and generally dialing the alerts way back. If it's broken at the top, or if some vital resource is headed to the red, let me know. But if you're going to alert based on a cascade of prior failures (and DoS my phone, email, pager, whatever), then STFU.

In Nagios, setting relationships between services and systems, for alerting services, setting thresholds appropriately, etc., is key.

For a lot of thresholds you're going to want to find out why they were set to what they were and what historical reason there was for that. It's like the old pot roast recipe where Mom cut off the ends of the roast 'coz that's how Grandma did it. Not realizing it was because Grandma's oven was too small for a full-sized roast....

Sadly, that level of technical annotation is often lacking in shops, especially where there's been significant staff turnover through the years.

I'm also a fan of some simple system tools such as sysstat which log data that can then be graphed for visualization.


Be sure to monitor your monitoring system as well (preferably from outside your network/datacenters)! If you don't have anything else in place, you can use Pingdom to monitor one website/server for free [0].

I was off work for a few months recently (motorcycle wreck) and removed my e-mail accounts from my phone. Now, I have all my alerts go to a specific e-mail address and those are the only mails I receive on my phone. It has really helped me overcome the problem of ignoring messages.

[0]: https://www.pingdom.com/free/


We monitor outgoing smtp and http connections from anything that requires those services.

And the best general advice I have is split your alerts into "stuff that I need to know is broken" and "stuff that just helps me diagnose other problems". You don't want to be disturbing your on-call people for stuff that doesn't directly affect your service (or isn't even something you can fix).


Also, are your backups working.


We had a perfect storm of problems only 2 weeks ago.

1. A vendor tomcat application had a memory leak, consumed all the RAM on a box, and crashed with an OOM

2. The warm standby application was slightly misconfigured, and was unable to take over when the primary app crashed

3. Our nagios was configured to email us, but something had gone wrong with ssmtp 2 days prior, and was unable to contact google apps

3a. No one was paying any attention to our server metric graphs / We didn't have good enough "pay attention to these specific graphs because they are currently outside the norm"

A very embarrassing day for us that one.

We're now working on better graphing, and have set up a basic ssmtp check to SMS us if there is an issue. Monitoring is hard.


You may want to check OpsGenie heartbeat monitoring, or essentially implement the same idea yourself. Our heartbeat monitoring expects to receive messages (via email or API) from monitoring tools periodically and notifies you via push/SMS/phone if we don't receive it over 10 minutes. I think this pattern is very useful to ensure that alert notifications is working.


> and have set up a basic ssmtp check to SMS us if there is an issue.

And what will happen when the network (or the alert server) is down?

You must put some check outside your network, with independent infrastructure. Adding another protocol on the same net is still subject to Murphy law.


Independent infrastructure is a good idea but not always feasible for everyone. At OpsGenie, to resolve this problem, we came up with a solution we refer as "heartbeat monitoring". This basically allows monitoring tools to send periodic heartbeat messages to us that indicate that the tools is up and can reach us. If we don't receive heartbeat messages from them in 10 minutes, we generate an alert and notify the admins. Not out of band management but does the trick to prevent situations like jsmeaton described.

http://support.opsgenie.com/customer/portal/articles/759603-...


You're using icanhazip.com in production? I see from a quick Google search that Puppy Linux seems to use it in some scripts, but how reliable is it?


$ curl -i -k -L icanhazip.com

HTTP/1.1 200 OK

Date: Mon, 10 Feb 2014 20:13:28 GMT

Server: Apache

Content-Length: 15

Content-Type: text/plain; charset=UTF-8

X-RTFM: Learn about this site at http://bit.ly/14DAh2o and don't abuse the service

X-YOU-SHOULD-APPLY-FOR-A-JOB: If you're reading this, apply here: http://rackertalent.com/

X-ICANHAZNODE: icanhazip2.nugget

Would seem only fair. :D


jsonip.com is also usable in production.


About reboot monitoring, I suggest to use kdump to dump the oops information and save it for later debugging and understanding of the issue. It may even be an uncorrectable memory or pcie error you are seeing and the info is logged in the oops but is hard to figure otherwise. Also, if you consistently hit a single kernel bug you may want to fix it or workaround it.


Also, are your API endpoints working properly.


Can you expand on that?


Ha I can.

Sometimes api providers change the damn response format. Or their urls change. Or they blocked your ip without notifying you.


Thanks.

I was thinking some sort of end-point test myself, hadn't considered the specific case of APIs.


I have gear in three different facilities and I'm typically visiting any of them unless I'm installing hardware or replacing it. Shortly after starting at $job, I realized there was no monitoring of the RAID arrays in the servers we have. That could have ended badly.


How oversold your VPS provider's server is commonly blamed for slowdown but rarely measured.


Between PRTG and Windows, almost all of that is handled for us. And PRTG can call OMSA by SNMP.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: