Show HN: Self-hosted, open-source infrastructure monitoring and alerting

falcolas · on Jan 13, 2014

Cool project, but I can't help but wonder why people are so quick to rewrite Nagios instead of spending less time and learning about Nagios.

The DSL isn't that confusing, and there are many many options there. I've been able to set up a very comprehensive monitoring service using it, and while I too once cursed its complexity, it was worth it in the long run. Certainly didn't take as long as it would take to write a clone.

dbuxton · on Jan 13, 2014

I think the thing for us was that we couldn't understand how to make Nagios see the world in terms of services (which are sometimes a set of machines running on physically or virtualised infrastructure, but sometimes individual servers on shared hosts). I'm sure it can be done, I just couldn't wrap my head around it.

Also it was surprisingly easy and fun to write. Probably has taken longer than learning Nagios but I don't think I would have spent Christmas hacking away at that task...

kokey · on Jan 13, 2014

I've spent 3 years making Nagios 'work'. I've had to add all the things Nagios didn't have, like inventory management, graphing, combining graphs, trend detection, service discovery, etc. I also had to make it scale. This also involved writing plugins in C to conserve memory since plugs are run through forking processes. The 'community' plugins are mostly really terrible. Scaling it also running things outside of the Nagios scheduler with another script and feeding the results back to Nagios. This allowed me to collect data for 120k parameters in 5 minutes from a single monitoring server. Anyway, this is just a long way of saying writing your own thing is probably not that bad an idea compared to having to learn Nagios and then trying to hack it into the shape of an enterprise monitoring system.

falcolas · on Jan 13, 2014

Sorry, inventory management? Graphing? Service discovery? Those don't strike me as fitting into the monitoring role. It seems to me like you spent 3 years bloating Nagios into a multi-role service that it was never originally designed to fulfill, as opposed to setting up 3-4 different services which specialize in what you're looking for (which took us significantly less than 3 years).

Monitoring is looking at the services and making sure they're running within tolerances. Graphing is not useful for monitoring, its useful for postmortems and planning. Inventory management is not useful for monitoring, it's useful for building out servers. Service discovery has nothing to do with monitoring, though it's certainly be useful for automatically populating your monitoring configuration.

However, trend detection, I agree; nagios needs more of that. I should definitely get alerts when my disk space usage suddenly starts growing 4x its normal amount, or when my CPU usage spikes to 3x its normal usage.

deathcakes · on Jan 13, 2014

> Graphing is not useful for monitoring, its useful for postmortems and planning.

I'm sorry, I have to completely disagree with this point. Graphing is a fantastic way of visualising information to show trends, such as disk space growth, or memory usage, which can and have, in our company's case, led to diagnosis of problems that hadn't yet manifested in a complete crash. I realise you add in a comment saying that Nagios should be doing these things internally, which it also should, but what about situations you haven't planned for? Or intermittent events that don't look like anything individually, but when graphed over 6 months turn out to reveal the subtle failure of an aircon unit in a server room?

falcolas · on Jan 13, 2014

My apologies if I wasn't clear.

My intention was to say, with regards to graphing:

Graphing should be part of your overall tooling for monitoring boxes, but it should not be part of process monitoring. It should be a separate component (which should in turn be monitored by Nagios), for use in post-mortems and future planning.

It should not be built into Nagios, since the graph itself is completely irrelevent to whether Nagios should alert about a problem.

I also think that graphing is different from heueristic rate monitoring; graphs are for humans to spot trends/correlations. Alerting on rate changes is a binary alert/don't alert.

I.E. Cacti/Graphite should be creating graphs about services for human consumption, Nagios should be watching for problems with services.

mlieberman85 · on Jan 13, 2014

If Nagios was a lightweight monitoring solution I would agree with you. Nagios however is a large monolithic system that is hard to integrate with third party metrics solutions, inventory management, etc. Though something like inventory management isn't part of monitoring it is still part of your whole system and if your monitoring solution can't easily integrate with your inventory management solution it gets frustrating.

I also tend to think service discovery or at least a flexible API that allows other systems to create new checks is a huge part of monitoring. Nagios' configuration files are a nightmare to deal with especially in a configuration managed environment.

falcolas · on Jan 13, 2014

I think you're looking at it backwards. Nagios shouldn't integrate with your inventory management; inventory management should be configuring Nagios. Chef, Puppet and Ansible all have modules for idempotently configuring services in Nagios, making it simple to have your nagios configuration be part of your inventory management solution.

Ideally, this also resolves your configuration management problem as well. If it does not, there are tools which can probe a running Nagios state file, and provide you feedback on what's being monitored, and its state. The python library nagparser is one that I've used in the past to probe Nagios status for a status aggregation tool.

jamespo · on Jan 13, 2014

Running scripts outside the nagios scheduler is straightforward with passive checks.

Also nagios 4 has changed the way forking checks work - http://labs.nagios.com/2013/09/20/nagios-core-4-now-availabl... - I appreciate this was too late for you.

IgorPartola · on Jan 13, 2014

I hear your pain. I once had to spend quite a while learning Nagios just to do some simple things. However, Nagios does have the capability do handle the very simple case you are describing. You have hosts [1] and you have services [2] that run on hosts. The relationships are many to many: some hosts run some services, while others run others, or none at all. When telling Nagios about all this, you simply define how to find a host, how to find a service, and finally which hosts to check for which service. This is logical and straightforward in my mind.

Now, the problem is that to get to the links I cited, I had to click through 5 links, and I knew exactly what I was searching for. Nagios's biggest problem is that their documentation looks archaic. Updating that would give the project so much more appeal.

[1] http://nagios.sourceforge.net/docs/nagioscore/4/en/hostcheck...

[2] http://nagios.sourceforge.net/docs/nagioscore/4/en/servicech...

stevekemp · on Jan 13, 2014

I wrote a distributed monitoring system a while back, and it has to be said that when you're testing "ping + ssh + ftp + http" on a few thousand servers nagios won't alert before your clients call.

falcolas · on Jan 13, 2014

Few thousand? Impressive. FWIW, you can create a stack of Nagios monitors, where each one reports to a higher instance using NRPE or some such, instead of relegating it to a single point of failure (or in your instance, slowness).

Anything that has to ssh, ftp and http to a few thousand servers is going to be slow.

stevekemp · on Jan 14, 2014

Imagine the internal monitoring that a hosting company might need. They'd have to manage:

* All the (virtual machine) host boxes.

* All the routers, switchers, and firewalls.

* Status-checks, on hosted sites, etc.

In the end I designed and implemented a system which was capable of running all the tests in around 90 seconds, by virtue of being distributed. One host does all the parsing and such like, and N-other hosts could pull out tests to execute. (As it happened we ran everything on a single box, but it was designed to be distributed, it just transpired that having 6-10 worker process pulling jobs from the queue to execute was good enough.)

Introduction:

http://blog.bytemark.co.uk/2012/12/19/custodian-a-network-mo...

Code:

https://projects.bytemark.co.uk/projects/custodian

michaelmior · on Jan 13, 2014

Agreed. I think one of the best things about Nagios is all the readily available plugins. I suspect more than enough to cover the current functionality of Cabot. I think Icinga[1] (a Nagios fork) also addresses some of the common issues people have with Nagios.

[1] https://www.icinga.org/

spantaleev · on Jan 13, 2014

For something that can help make Nagios management easier, you could use something like nagadmin: https://github.com/devture/nagadmin

It's both a web-configurator and a replacement for the default interface - no need to learn much about Nagios (unless you want to) and should be satisfactory for a simple use-case. Certainly not a replacement for the more complicated and scalable configuration/frontend solutions out there, but it does serve its purpose.

jorts · on Jan 13, 2014

Opsview, Icinga, Groundworks, Nagios XI, etc... also put a decent UI on top of Nagios Core.

beagle3 · on Jan 13, 2014

Can Cabot handle "multihomed" services?

I have services running remotely, that are accessible through a variety of communication methods (VPNs, SSH tunnels, 3G modems, and occasionally just straight up IP!) and it's been awhile, but when I set up nagios, it seemed there was no way to tell it: "Alert if this host is not reachable in any of these 50 ways" - it monitors all 50, and alerts me whenever each one of them changes state (which is quite often).

What I want to say is: Here are all the ways I can reach server "pinky", I want a warning if there are less than 3 ways for more than an hour, and an alert if non of the way works for more than 3 minutes.

Can cabot do that?

(for that matter, can any other monitor do that?)

dbuxton · on Jan 13, 2014

This sort of service-oriented monitoring is very close to the pain point that caused us to create Cabot.

In our case, it was more that there were multiple things we could monitor for that potentially indicated a problem with a service but it was hard, using Nagios or whatever, to tie those individual indicators back to a single service going haywire at 3am, especially if all 50 of them blow up at once.

We don't have the precise problem that you describe though - for us the ability to monitor the number of data series in a graphite collection (so that if a server disappears we can notify, and then if another drops off we can alert) is sufficient. Cabot can use this and the ability to tolerate a number of failures to give behaviour very close to what you talk about. However I don't know if Cabot would support the kinds of checks that you're carrying out out-of-the-box, you'd probably have to extend.

mhurron · on Jan 13, 2014

I think Zabbix can do what you want with trigger dependencies.

https://www.zabbix.com/documentation/1.8/manual/config/host_...

Though I'm not sure about the "warning if there are less than 3 ways for more than an hour" requirement.

sehrope · on Jan 13, 2014

I'm not sure about Cabot (first time seeing it is today) but I've been designing a system that supports this type of monitoring. Shoot me an email (contact in profile) if you'd like to be contacted once it's up.

Wouter33 · on Jan 13, 2014

Great! Have been looking for something like this lately! First impression is good. But what i can't seem to find is a http notification. I would like a http alert if something is down for a x number of consecutive pings. This way i can do some API calls or implement my own alerting.

AJP1 · on Jan 13, 2014

Is that what you need? http://cabotapp.com/use/http-checks.html or open an issue on: https://github.com/arachnys/cabot

sehrope · on Jan 13, 2014

I think the GP is referring to web hooks[1]. When an alert gets triggered, rather than sending out an email/sms/smoke-signal you'd do an HTTP POST to a URL. The receiver of the HTTP POST could then add there own custom alerting or hook it into an existing process.

[1]: http://en.wikipedia.org/wiki/Web_hook

Wouter33 · on Jan 13, 2014

Yes, web hooks indeed.

Wouter33 · on Jan 13, 2014

That's monitoring. Alerting seems only possible via Email, Hipchat, SMS and Telephone.

dbuxton · on Jan 13, 2014

Primary author here. Please add an issue at https://github.com/arachnys/cabot/issues/new and we'll look into adding. (Essentially that's all Hipchat is).

AJP1 · on Jan 13, 2014

Understood. No it doesn't at the moment, would be trivial to add.

wut42 · on Jan 13, 2014

I deployed it at work last Friday and i'm in love with it already. :) Good job guys!

dbuxton · on Jan 13, 2014

Glad to hear it, thanks for the positive feedback!

chanux · on Jan 13, 2014

Salmon is a monitoring/alerting system built on Django.

https://github.com/lincolnloop/salmon

BaconJuice · on Jan 13, 2014

Can this be deployed on a OSX? If so can you provide instructions on how to deploy this on a local OSX server? Thank you.

dbuxton · on Jan 13, 2014

Don't see why not, but all the upstart stuff is platform-specific.

I'd suggest you spin up a clean Ubuntu instance in Virtualbox and just run the fab deploy script against that locally... It won't be fast, it won't be pretty, but I'm pretty sure it will work.

wut42 · on Jan 13, 2014

I've installed my instance on FreeBSD, i'm pretty sure you can install it without issues on an OS X Server. Read the install scripts they give and adapt it for OS X.

zapt02 · on Jan 13, 2014

Cool!