At our company we do devops and the main thing that frustrates me is the total violation of DRY that occurs. For us its not just monitoring that sucks: its also application configuration, failover, and documentation.
For example, email. Our email sending library requires various references to internal ips, external ips, and/or domains it can send to/from/through. Then our cfengine files have the same exact references in multiple locations: sendmail settings, dns settings, and network settings. Then our nagios alert system, again, has references to many of them. Oh, then we need to have a wiki for human readable format rather than having to dig through these scripts. And that's just for email. Nevermind databases, caches, app servers, etc.
We have health checks within custom developed applications for failover, and Nagios health checks. Sometimes Nagios will rely on the app health checks, but the app never relies on ops related monitoring. It just seems like there's quite a bit of effort being wasted here as we're doing the same thing twice in two different places.
The frustrating part is when I change one, I have to go through and change them all. It seems like there could be a much better system than this. I would imagine many come up with custom scripts for all of this, but there has to be a better way.
I'd love to hear more about this. What exactly stopping you (or, more precisely, your company) to configure your email sending library and nagios through cfengine, and generate docs from cfengine scripts? (XML is supposed to help with that.)
Also, I don't see the issue with nagios asking apps "you still breathing?". And why does a library know about IPs (if it's strictly a library)?
I've been working on this problem, in various forms, for the last ten years. It's pushed me to write https://github.com/aphyr/reimann, a network event stream processor for centralized graphing, metrics, dashboards, alerting, and analytics. It makes it easy to forward events to other monitoring systems, too. I'm using it in production at http://showyou.com now; after I'm comfortable with its stability under load, I'll make an official release--probably within a month.
[edit] I can also attest to the awesomeness of Boundary's platform. These guys have a killer UI, excellent reliability, and collect important, typically invisible data. I started work on Reimann because I needed to handle more than just network traffic--from the rate of feed item fanouts to a breakdown of memory consumption across all hosts. I also have different dashboard requirements. Regardless, I'm excited to see Boundary's take on the problem.
Interesting article with several items that are relevant to me. I'm curious whether this service will have any open source components and whether it can be self-hosted.
Boundary already has a bunch of open source stuff and more coming soon. https://github.com/boundary. Our first offering will be as a service. If you want to talk more ping me cliff@boundary.com.
Not strictly related to this problem, but if you are looking for a kickass application monitoring, that tracks your application's performance from user's browser to the function call on your server, take a look at http://newrelic.com/
disclosure: I'm just a really really happy customer
We also use New Relic and are extremely pleased with it. However if we agree that all 5 of the OP's points are legitimate identifiers of a well-rounded monitoring system, New Relic only meets 3 of the 5: deep integration, high resolution, and dynamic configuration.
New Relic is indeed an excellent monitoring tool but it's not perfect. Alerts do lag behind anywhere from two to 5 minutes, and there are no context-sensitive alerts... at least not that I'm aware of.
For example, email. Our email sending library requires various references to internal ips, external ips, and/or domains it can send to/from/through. Then our cfengine files have the same exact references in multiple locations: sendmail settings, dns settings, and network settings. Then our nagios alert system, again, has references to many of them. Oh, then we need to have a wiki for human readable format rather than having to dig through these scripts. And that's just for email. Nevermind databases, caches, app servers, etc.
We have health checks within custom developed applications for failover, and Nagios health checks. Sometimes Nagios will rely on the app health checks, but the app never relies on ops related monitoring. It just seems like there's quite a bit of effort being wasted here as we're doing the same thing twice in two different places.
The frustrating part is when I change one, I have to go through and change them all. It seems like there could be a much better system than this. I would imagine many come up with custom scripts for all of this, but there has to be a better way.