

Why I'll be letting Nagios live on - js2
https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/

======
jccooper
Quick summary, since the site is down:

In response (obviously) to [http://www.slideshare.net/superdupersheep/stop-
using-nagios-...](http://www.slideshare.net/superdupersheep/stop-using-nagios-
so-it-can-die-peacefully) (also available through HN).

Author uses Nagios at Etsy, with 10,000 checks (mostly in the 2 minute range),
and it seems to work well for them with some minor tweaking. They have a
plugin that provides a REST API. And he thinks the rest of the complaints are
about backend stuff that he doesn't deal with much or at all (configs, wire
formats, et al.)

He considers Nagios to be simpler than the proposed Sensu, and prefers "Unix
style" applications rather than monolithic ones, where Nagios is certainly on
the simpler side.

So he will continue using Nagios because it works well for his use, and good
luck making something better.

~~~
toomuchtodo
> So he will continue using Nagios because it works well for his use, and good
> luck making something better.

Ops guy here. If something better than Nagios/Zabbix came along, ops people
would use it. Nagios gets used because 1) people know it and 2) its pretty
easy to setup.

Is it going to scale to Netflix or AWS scale organizations? Probably not. But
at that scale, I presume you have buckets of money to throw at problems (i.e.
build a custom monitoring platform).

~~~
nobodysfool
That's like saying 'If something better tha PHP came along, programmers would
use it'. Something better did come along, but that doesn't change the fact
that PHP is still popular.

~~~
akerl_
It's not like saying that at all. There aren't a plethora of full-featured
monitoring/alerting tools out there.

------
rjzzleep
i love the diagram making fun of too complex systems

[https://pbs.twimg.com/media/BNELF1GCUAExynU.png:large](https://pbs.twimg.com/media/BNELF1GCUAExynU.png:large)

for reference the architectural diagram of sensu

[http://portertech.ca/images/2011-11-01/sensu-
diagram.png](http://portertech.ca/images/2011-11-01/sensu-diagram.png)

------
beaker52
Hopefully Nagios is showing him some red metrics right about now.

~~~
yeukhon
The first bulletin point: _“Doesn’t scale at all.”_ I wonder how much traffic
is sending to his blog right now. Just in 1 hour we _DDoS_ -ed his blog with
HNers visiting his blog.

~~~
toomuchtodo
I'm going to buy pg and company a beer and get code integrated into HackerNews
that turns all submitted links into Coral cache links. Its 2014 and the
Slashdot effect is alive and well :(

~~~
yeukhon
That's actually a great idea! :) Haven't actually thought of that. Two things
in mind: rewrite links but provides copy-paste for the original link (think
about XSS from clicking a link on tweet and people want to look at the link
before clicking it) and (2) HN's rewrite/shortened link must do dynamic cache
url retrieval; not static. I think Google's cache link is different as cache
is updated.

~~~
toomuchtodo
Thanks! I agree with both points; I'd even suggest adding a link to the
Internet Archive's copy, but baby steps.

------
rrreese
Google cache:
[http://webcache.googleusercontent.com/search?q=cache%3Ahttps...](http://webcache.googleusercontent.com/search?q=cache%3Ahttps%3A%2F%2Flaur.ie%2Fblog%2F2014%2F02%2Fwhy-
ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-
much%2F&oq=cache%3Ahttps%3A%2F%2Flaur.ie%2Fblog%2F2014%2F02%2Fwhy-ill-be-
letting-nagios-live-on-a-bit-longer-thank-you-very-
much%2F&aqs=chrome..69i57j69i58.1552j0j4&sourceid=chrome&espv=210&es_sm=93&ie=UTF-8)

------
lozzd
Well, this is embarrassing.

My colo server died earlier today (completely unrelated to this) and didn't
come back up because GRUB hadn't reinstalled properly on a replaced software
RAID disk.

But yes, Nagios did tell me it was down :)

------
jcmcken
> ...just the architectural diagram of how it works scares the shit out of me.
> When you need 7 arrow colours to describe where data is going in a
> monitoring system, I’m starting to fear it slightly.

This strikes me as a pretty lazy argument. Admittedly, that diagram is not the
best. But that you can separate the components of Sensu isn't a bug, it's a
feature. No one says you have to do it this way. In fact, you can exactly
replicate the architecture of Nagios by having a single server. The point is
you have choices, choices which are dependent not on the monitoring software
(which is very lightweight by design in the case of Sensu), but on the other
open source software it relies on (Redis, RMQ).

So I would argue that the "server instance" scope is way too broad a category
to measure complexity. If you attempted to diagram the workflow Nagios uses
(ignoring for a moment what server instance each component is on), you would
come up with something equally bad (if not worse). That's if you even
understood anything at all about how Nagios works to know what to diagram.

So let's replace one crude measure with another. The Sensu core repo is ~3MB
total. Nagios core is about 10x that (30MB). NRPE is about 1MB all by itself.
Mod_gearman (to pick out an add-on) comes in at a whopping 6MB. Suffice it to
say, but for something that's basically a glorified exit code validator, this
seems like a lot of complexity. Sure, Nagios has a lot of features that Sensu
doesn't have, and that accounts for some of this. But there's a lot to be said
for modular systems vs. monolithic ones.

------
lnanek2
Is this some sort of joke? Maybe it is intentionally down and alerting him? :)

------
nasalgoat
His point about the NRPE config is spot-on - just put all your checks in one
file. Done.

Not sure why people are making Nagios configuration more complicated than it
needs to be.

------
Torn
Site down - anyone got a mirror?

~~~
mzs
[http://webcache.googleusercontent.com/search?q=cache:mMFIAge...](http://webcache.googleusercontent.com/search?q=cache:mMFIAgeTTOQJ:https://laur.ie/blog/2014/02/why-
ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/&strip=1)

