
Show HN: Self-hosted, open-source infrastructure monitoring and alerting - dbuxton
http://cabotapp.com
======
falcolas
Cool project, but I can't help but wonder why people are so quick to rewrite
Nagios instead of spending less time and learning about Nagios.

The DSL isn't that confusing, and there are many many options there. I've been
able to set up a very comprehensive monitoring service using it, and while I
too once cursed its complexity, it was worth it in the long run. Certainly
didn't take as long as it would take to write a clone.

~~~
dbuxton
I think the thing for us was that we couldn't understand how to make Nagios
see the world in terms of services (which are sometimes a set of machines
running on physically or virtualised infrastructure, but sometimes individual
servers on shared hosts). I'm sure it can be done, I just couldn't wrap my
head around it.

Also it was surprisingly easy and fun to write. Probably has taken longer than
learning Nagios but I don't think I would have spent Christmas hacking away at
that task...

~~~
kokey
I've spent 3 years making Nagios 'work'. I've had to add all the things Nagios
didn't have, like inventory management, graphing, combining graphs, trend
detection, service discovery, etc. I also had to make it scale. This also
involved writing plugins in C to conserve memory since plugs are run through
forking processes. The 'community' plugins are mostly really terrible. Scaling
it also running things outside of the Nagios scheduler with another script and
feeding the results back to Nagios. This allowed me to collect data for 120k
parameters in 5 minutes from a single monitoring server. Anyway, this is just
a long way of saying writing your own thing is probably not that bad an idea
compared to having to learn Nagios and then trying to hack it into the shape
of an enterprise monitoring system.

~~~
falcolas
Sorry, inventory management? Graphing? Service discovery? Those don't strike
me as fitting into the monitoring role. It seems to me like you spent 3 years
bloating Nagios into a multi-role service that it was never originally
designed to fulfill, as opposed to setting up 3-4 different services which
specialize in what you're looking for (which took us significantly less than 3
years).

Monitoring is looking at the services and making sure they're running within
tolerances. Graphing is not useful for monitoring, its useful for postmortems
and planning. Inventory management is not useful for monitoring, it's useful
for building out servers. Service discovery has nothing to do with monitoring,
though it's certainly be useful for automatically populating your monitoring
configuration.

However, trend detection, I agree; nagios needs more of that. I should
definitely get alerts when my disk space usage suddenly starts growing 4x its
normal amount, or when my CPU usage spikes to 3x its normal usage.

~~~
deathcakes
> Graphing is not useful for monitoring, its useful for postmortems and
> planning.

I'm sorry, I have to completely disagree with this point. Graphing is a
fantastic way of visualising information to show trends, such as disk space
growth, or memory usage, which can and have, in our company's case, led to
diagnosis of problems that hadn't yet manifested in a complete crash. I
realise you add in a comment saying that Nagios should be doing these things
internally, which it also should, but what about situations you haven't
planned for? Or intermittent events that don't look like anything
individually, but when graphed over 6 months turn out to reveal the subtle
failure of an aircon unit in a server room?

~~~
falcolas
My apologies if I wasn't clear.

My intention was to say, with regards to graphing:

Graphing should be part of your overall tooling for monitoring boxes, but it
should not be part of process monitoring. It should be a separate component
(which should in turn be monitored by Nagios), for use in post-mortems and
future planning.

It should not be built into Nagios, since the graph itself is completely
irrelevent to whether Nagios should alert about a problem.

I also think that graphing is different from heueristic rate monitoring;
graphs are for humans to spot trends/correlations. Alerting on rate changes is
a binary alert/don't alert.

I.E. Cacti/Graphite should be creating graphs about services for human
consumption, Nagios should be watching for problems with services.

------
beagle3
Can Cabot handle "multihomed" services?

I have services running remotely, that are accessible through a variety of
communication methods (VPNs, SSH tunnels, 3G modems, and occasionally just
straight up IP!) and it's been awhile, but when I set up nagios, it seemed
there was no way to tell it: "Alert if this host is not reachable in any of
these 50 ways" \- it monitors all 50, and alerts me whenever each one of them
changes state (which is quite often).

What I want to say is: Here are all the ways I can reach server "pinky", I
want a warning if there are less than 3 ways for more than an hour, and an
alert if non of the way works for more than 3 minutes.

Can cabot do that?

(for that matter, can any other monitor do that?)

~~~
dbuxton
This sort of service-oriented monitoring is very close to the pain point that
caused us to create Cabot.

In our case, it was more that there were multiple things we could monitor for
that _potentially_ indicated a problem with a service but it was hard, using
Nagios or whatever, to tie those individual indicators back to a single
service going haywire at 3am, especially if all 50 of them blow up at once.

We don't have the precise problem that you describe though - for us the
ability to monitor the number of data series in a graphite collection (so that
if a server disappears we can notify, and then if another drops off we can
alert) is sufficient. Cabot can use this and the ability to tolerate a number
of failures to give behaviour very close to what you talk about. However I
don't know if Cabot would support the kinds of checks that you're carrying out
out-of-the-box, you'd probably have to extend.

------
Wouter33
Great! Have been looking for something like this lately! First impression is
good. But what i can't seem to find is a http notification. I would like a
http alert if something is down for a x number of consecutive pings. This way
i can do some API calls or implement my own alerting.

~~~
AJP1
Is that what you need? [http://cabotapp.com/use/http-
checks.html](http://cabotapp.com/use/http-checks.html) or open an issue on:
[https://github.com/arachnys/cabot](https://github.com/arachnys/cabot)

~~~
Wouter33
That's monitoring. Alerting seems only possible via Email, Hipchat, SMS and
Telephone.

~~~
dbuxton
Primary author here. Please add an issue at
[https://github.com/arachnys/cabot/issues/new](https://github.com/arachnys/cabot/issues/new)
and we'll look into adding. (Essentially that's all Hipchat is).

------
wut42
I deployed it at work last Friday and i'm in love with it already. :) Good job
guys!

~~~
dbuxton
Glad to hear it, thanks for the positive feedback!

------
chanux
Salmon is a monitoring/alerting system built on Django.

[https://github.com/lincolnloop/salmon](https://github.com/lincolnloop/salmon)

------
BaconJuice
Can this be deployed on a OSX? If so can you provide instructions on how to
deploy this on a local OSX server? Thank you.

~~~
dbuxton
Don't see why not, but all the upstart stuff is platform-specific.

I'd suggest you spin up a clean Ubuntu instance in Virtualbox and just run the
fab deploy script against that locally... It won't be fast, it won't be
pretty, but I'm pretty sure it will work.

------
zapt02
Cool!

