
Stop using Nagios (so it can die peacefully) - erbdex
http://www.slideshare.net/superdupersheep/stop-using-nagios-so-it-can-die-peacefully
======
markdennehy
So... stop using a debugged and stable tool whose limitations and problems are
well-known and understood and replace it with six bits of software duct-taped
together, two of which aren't working yet (if they even exist), without any
idea of how they interact when they hit edge cases.

I mean, "don't use X, use ShinyX instead" is one thing (and most of the time
it's a bad thing but it does occasionally turn up good ideas), but this is
just So Much Worse...

~~~
seiji
You're presenting the MySQL argument. "Why should we switch since we know it
fails in exactly these 1,000 different ways and we can fix these problems?
Using something better has unknown failure scenarios!"

Have you ever been woken up by a nagios page that automatically cleared after
five minutes because the incoming queue was delayed past the alert interval?

Have you ever had your browser crash because you click on the wrong thing in
the designed-in-1996-and-never-updated nagios interface and had your browser
crash because it dumps 500MB of logs to your screen?

Have you ever had services wake you up with alert then clear then alert then
clear again because some new intern configured a new monitor but didn't set up
alerting correctly (because lol, they don't get paged, so who gives a flip if
they copied and pasted the wrong template config, as is standard practice)?

Have you had to hire "nagios consultants" to figure out how to scale out your
busted monitoring infrastructure because nagios was designed to run on a
single core Pentium 90?

Being pro-nagios is like being pro-Russia, pro-North Korea, and pro-Rap Genius
while arguing "but at least we know how bad they are and can keep them in
line."

~~~
blueskin_
>Have you ever been woken up by a nagios page that automatically cleared after
five minutes because the incoming queue was delayed past the alert interval?

No, because I know how to configure escalations properly.

>Have you ever had your browser crash because you click on the wrong thing in
the designed-in-1996-and-never-updated nagios interface and had your browser
crash because it dumps 500MB of logs to your screen?

Actually, no. I've had my browser crash due to AJAX crap all the time though.
Nagios' (and Icinga classic's) interface is clear, simple and logical; it's
just not 2MB of worthless javascript that wastes half my CPU time, so I can
see why unpopular with some user types.

>Have you ever had services wake you up with alert then clear then alert then
clear again because some new intern configured a new monitor but didn't set up
alerting correctly (because lol, they don't get paged, so who gives a flip if
they copied and pasted the wrong template config, as is standard practice)?

No, because I know how to use time periods, and escalations again.

>Have you had to hire "nagios consultants" to figure out how to scale out your
busted monitoring infrastructure because nagios was designed to run on a
single core Pentium 90?

No, because it isn't, because I know the basics of Linux performance tuning,
and because I've heard of Icinga and/or distributed Nagios/Icinga systems for
very large scale.

Your post reads like "Have you ever crashed your car head on into a concrete
wall at 70mph because it didn't brake for me?". No amount of handholding a
program can do will protect users who have no clue how to use it.

I do not by any means consider myself an expert in Nagios either - if there
was such a market for consultants as you claim, I'd likely be doing it and
therefore be rich, but in actual fact, it's a skill just about any mid-level
or better admin has.

I've inherited a Nagios config before that was a mess, that I rebuilt from
scratch in a maintainable way, as well as extended. If Nagios (or MySQL pre-
Oracle, for that matter) has a problem, it's amateurs attempting it, making a
mess, and others judging the quality of the tool on their sloppy work. Not
unique to Nagios, by any means. If there's a criticism you can level at Nagios
for that, it's the lack of documentation and examples in the config files.

I'm also not denying the existence of alternatives - OpenNMS is ok, as is
Zabbix, but both are far more limited in terms of available plugins and
extensibility, and by nature harder to extend. Munin is good for out of the
box graphing, but relatively poor for actual monitoring/alerting and hard to
write new plugins for with limited availability of additional plugins. Each
one is a standalone tool that's good for a purpose, and not some vaguely
defined set of programs, partly nonexistent, that everyone has to hack
together for themselves.

~~~
lifeisstillgood
not trolling - but how do you configure escalations properly in a circumstance
where a queue might delay longer than any arbitrary period? In short - tell me
your secrets

~~~
rhizome
Well, why are you triggering on something that appears to have a completely
random amount of delay? Either you choose your line in the sand, or you
monitor the dependency that is causing the variability.

~~~
seiji
A typical nagios alert will fire if it hasn't been updated in X seconds.
Sometimes the queue of incoming events gets backed up and nagios doesn't
receive the results of service probes until X+5 seconds or 2X seconds later
(due to internal nagios design problems, not the services actually being
delayed).

So, nagios thinks "Service Moo hasn't contacted us in 60 seconds, ALERT!" when
the update is actually in the event log, but nagios hasn't processed it yet.

~~~
blueskin_
I haven't seen this in ~1k services, but I guess it probably depends on the
spec of the monitoring system to some degree, and I realise that 1k+ hosts is
likely a different story. If you're using passive checks in any high-rate
capacity, you should be using NSCA or increasing the frequency they are read
in anyway. This is also another problem Icinga handles better - while I say
Nagios for convenience's sake, my comments here refer to Icinga (and to Nagios
XI, which is comparable but stupidly expensive).

------
wdewind
Stop using nagios, all you have to do is string together 6 random pieces of
software, 2 of which don't exist yet!

~~~
viraptor
2 of them are not available in nagios at all (graphing and anomaly detection)
and 1 already sucks completely (UI), so I'm not sure this is a good way to
look at the presentation.

~~~
carterparks
Most nagios users configure the pnp4nagios plugin for graphing

~~~
linker3000
...and some use Centreon, which bolts graphing and a better UI onto Nagios
'out of the box'

[http://www.centreon.com/](http://www.centreon.com/) ..or install via FAN:
[http://www.fullyautomatednagios.org/wordpress/](http://www.fullyautomatednagios.org/wordpress/)

~~~
kudu
It isn't very clear to me which Nagios plug-ins are considered standard, and
perhaps that sort of confusion is what creates all the FUD surrounding Nagios.

------
madaxe_again
The "main problem" with nagios is that it's configuration is godawful. The
"main problem" with most nagios users is that they edit their configuration
manually.

Automate that shit. We use Nagios to monitor our infra (5000+ checks, hundreds
of hosts), and chef maintains the config. Works without a hitch - and best of
all - it's been running for years, and not once has anyone had to poke it with
a stick.

Yes, NetSaint is old, yes, the UI is worse to look at than Putin's crotch,
yes, the plugin architecture is whimsical as all shit - but... _IT WORKS AND
YOU CAN RELY ON IT_.

~~~
vacri
This is one of the weird things about puppet. The core of puppet has general
types, and software-specific stuff is in modules... except for Nagios rules,
which are core types. Struck me as odd that only Nagios gets this 'star'
treatment in the core.

[http://docs.puppetlabs.com/references/latest/type.html](http://docs.puppetlabs.com/references/latest/type.html)

~~~
barrkel
Puppet is almost entirely composed from weird things, so much so that I'm
surprised you're surprised.

rspec-puppet in particular; so much pain.

------
blueskin_
That in one line: "But I don't like it because it doesn't do things my
preferred way!".

People use Nagios because it works, and it gets everything right, including
config if you have any clue whatsoever how to set up a good object hierarchy.
The only real problem with it is maintenance, an issue which Icinga resolved
long ago.

~~~
zwily
We add/remove about 150 nodes every day to/from our monitoring system
automatically via APIs. That use case has always sucked for me with Nagios.
How would you do that?

~~~
darkandbrooding
Are you making the Nagios host pull (via NRPE) or are you asking the
individual hosts to push (via NSCA)? I am trying to solve a similar problem.
Given a dynamic population of hosts, each of which has a variable life span, I
think that asking individual hosts to query their own state and then push that
to a "monitoring receiver" is the most scalable, sustainable approach. At
least, that's the theory I'll be testing this week.

~~~
zwily
We're not actually using Nagios. We use sensu because it was designed with
this sort of dynamic environment in mind. (I'm trying to stay away from the
"C" word. :)

------
rbc
I'll only address distributed monitoring. Use NSCA instead of NRPE. That
bypasses the limitations of the Nagios active check scheduling. I have a
wrapper that I use for that:

[http://rbcarleton.com/send_nsca_service_check.shtml](http://rbcarleton.com/send_nsca_service_check.shtml)

Use some kind of automation system like CFEngine for the distributed
scheduling. Some assembly required ;)

------
roeme
My main experience from working with nagios for somewhere around 10 years now
is that when people complain about it, they are either too lazy to read, or
too inept to understand, the documentation or the architecture. (1)

That being said, there _are_ limitations (but scaling is not one of them) to
nagios, and the configuration is definitively not something you do in cute
widdle config.yml.

Combined with the recent negative developments with the corporation behind the
Nagios trademark and the enterprise version - which the author fails to
mention, and should be even more alarming - one should at least consider using
and contributing to bareos, the (hopefully) true OSS fork of nagios (I will
for future deployments).

Oh hey, look at that, it would even pose the possibility to _improve_ the
software. (I really don't see why an almost-complete rewrite of nagios should
be necessary. Even after reading these slides(2)).

(1) That includes the author.

(2) Or rather especially after reading them.

~~~
SmokeyMcPot
> one should at least consider using and contributing to bareos, the
> (hopefully) true OSS fork of nagios

Bareos [1] is more an fork of Bacula [2], Isn't it? Did you mean Icinga [3]?

[1] [http://www.bareos.org/en/](http://www.bareos.org/en/)

[2] [http://www.bacula.org/en/](http://www.bacula.org/en/)

[3] [https://www.icinga.org/](https://www.icinga.org/)

~~~
roeme
Oops, of course ! (Unfortunately, the edit allowed timespan on my comment has
expired).

The irony here is that with bacula a similar thing happened.

------
lafar6502
Great idea. You can always use ugly Nagios for monitoring your great
monitoring system built on top of RabbitMQ, Ruby, Elasticsearch, Redis and few
other famous components.

------
ah-
I have wanted to try out collectd
([https://collectd.org/](https://collectd.org/)) for some time now, does
anyone have experience with it?

~~~
giulianob
I just set it up last week to push data to Graphite. It took a little bit of
time to understand how to configure it and the docs have conflicting
information in some places. Also, you will need to build it yourself or get a
PPA if you're on Ubuntu 12.04 and want to use it with Graphite since the
version that Ubuntu ships with doesn't support graphite.

I haven't tried to write plugins for it but it comes with a lot out of the box
and it's working well.

------
gk1
You should try Scalyr ([https://www.scalyr.com/](https://www.scalyr.com/)).
It's easier than juggling between six different tools. It was built by ex-
Google DevOps engineers for the same reason you made this Slideshare: the
available tools suck.

(Full disclosure: I'm working with Scalyr, but you should still try it.)

~~~
fotcorn
I don't think something like this should be hosted/SAAS. I don't want to send
my Gigabytes of logfiles (sometimes containing confidential informations ...)
over the internet to some unknown entity with probably questionable security.

My small startup company already has more than 10 (virtual) servers which
would cost us 500 dollars a month to monitor which is more than the servers
itself cost.

~~~
snewman
[Scalyr founder here]

We take security very seriously, but let me turn this around into a question:
what would it take for you to trust an external service to manage your logs?
Some of the things we're doing:

1\. SSL everywhere (including internal traffic between our backend servers).

2\. We add a tag to the raw representation of every string value, so that we
can verify that data never leaks across accounts. (This has never detected a
problem -- except in tests, because yes, we do test it.)

3\. Implementing in a "safe" language (Java), to rule out low-level buffer
management bugs.

4\. As Greg noted, we make it trivial for you to redact sensitive data before
it leaves your server.

We are sometimes asked for an on-premises installable version of our service.
We don't provide that because we're using economies of scale on the backend to
completely change the log management experience: when you give us a query,
every CPU and spindle in our entire cluster is briefly devoted to that query.
This means that you aren't limited to graphing predefined metrics; you can do
ad-hoc exploration of your entire log corpus in on the fly. E.g. display a
histogram of response latencies for all requests for url XXX on server group
YYY in the last 48 hours, and expect a near-instantaneous response.

~~~
mrweasel
>what would it take for you to trust an external service to manage your logs?

I think that's a really weird question that completely fails to address the
concerns that some people might have. We do logging of sales, profit margins
and stuff like that. You can't have access to that because: "You're not us".
If you can read our data, then we're not going to use your service and to do
anything useful the logs you really do need read access.

Of cause you might have no reason to spy on our data, but the only safety is
that you promise not to. We could seperate logs for different things, so
webserver logs go to you, but email logs goes to an internal system, but then
we would need two systems.

~~~
wdewind
Do you ever email about this data or put it into spreadsheets on Google's
servers?

~~~
nolok
What's your point there, that because they are already exposed in some areas
they shouldn't care being exposed in more ? Or are you looking for a "ah-ah"
moment; "you're already compromising that data" ?

Because either way, that doesn't change at all the concerns he voiced
regarding this particular service.

~~~
wdewind
I suppose both? I think it's pretty reasonable to expect a business dealing
with storing sensitive data to not look at that data, regardless of it is
email or logs.

------
cjlm
Good response article:

[https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-
live-...](https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-
bit-longer-thank-you-very-much/)

------
poulsbohemian
I have consulted on operational monitoring for many years, including with a
customer that claims to have the largest deployment of Nagios anywhere (50K+
nodes). The author hits on many good points, right up until they suggest a
solution. My advice to customers has long been that you _can_ make any tool
successful, but the tools are not what really matter. Too often I've seen
customers invest $MM in tooling, and fail to understand that people and
process around that tooling is the real challenge. Too often both the
entrenched enterprise vendors AND startups in this space miss this too. When
it comes to tooling, the problem that too many startups miss is that they
repeat the patterns that the entrenched players formed decades ago, and fail
to understand that the kind of monitoring tools like Nagios and its clones
offer is but one piece of a comprehensive solution for all but the smallest of
operations.

~~~
rhizome
Hah, you sure are a consultant: your last four sentences basically repeat
themselves. ;)

------
evantahler
I'm a big fan of monit (tried and true) + m/monit (web interface + more
complex logging and anayltics)

[https://mmonit.com/](https://mmonit.com/)

------
noja
So use Shinken [http://www.shinken-monitoring.org/](http://www.shinken-
monitoring.org/), the Nagios rewrite in Python.

~~~
scott_karana
Shinken was specifically mentioned in the slides as just a "Nagios", and not
solving the problem.

Not sure whether that's true or not, but they did address it...

~~~
noja
Shinken is built to scale.

------
rjzzleep
has anyone tried icinga or opennms, and can comment on that ?

[https://www.icinga.org/nagios/feature-
comparison/](https://www.icinga.org/nagios/feature-comparison/)

~~~
comice
opennms is a bit of a culture shock if you're used to nagios. It works kind of
inversely, in that it wants to auto-discover the servers and services to
monitor itself. Frankly, it feels very much designed around snmp imo (which
I'm not saying is a problem, but it's different to how we use nagios).

It's also the opposite of nagios in that rather than lots of smaller moving
parts, it is one big mega (java) process that does everything. Again, not
necessarily a problem (though I happen to think so :), but different.

I also found opennms to be VERY complicated. I suppose nagios is though, first
time around.

For some reason though, I really want to use opennms and keep going back to
try it out, but eventually give up.

~~~
seiji
OpenNMS is great to run _in addition to_ a traditional monitoring system.

Your traditional monitoring systems have hand-selected features to monitor and
alert for. OpenNMS will just go out and discover everything you have (and
_graph_ everything without any intervention too).

You probably aren't monitoring all the statistics on every interface of your
switches (what? people have switches?), but just throw OpenNMS at your
networking management subnet and it'll pick up everything for later review.

You _can_ use OpenNMS for alerting and inventory tracking, but I prefer more
extensible tools for those. Just use OpenNMS as a largely hands-off sanity
check of your existing monitoring and graphing systems.

------
zimbatm
Sensu is alright but it also has a few downsides:

I'm not a huge fan of having yet another debian package with it's own version
of ruby packaged. It does make the plugins easier to write though.

Checks need to be installed on the client (like nagios). It means that some
coordination is necessary when you want to add a new check on the server side.
This is largely resolved when using a configuration management system but it
doesn't seem clean to me. The sensu-community repo has a lots of checks which
is great to get started, some of them need some ruby gem dependencies to work
though.

I had issues with malformed json config or rabbitmq disconnections which would
crash the server. Because the debian packages uses the old sysvinit it wasn't
restarting. Moved the init scripts to upstart and added json validation when
generating the config and now it's fine.

------
laichzeit0
Nagios is pretty much a joke compared to most enterprise production monitoring
tools, e.g. Wiley and Foglight. I always find it funny to read what people
consider "monitoring" they're talking about a few disparate metrics and then
complain that the alerting/paging sucks.

If the tool can't trace a transaction end-to-end. I.e if a user visits your
page or uses your application you need the ability to trace it from Http to
EJB across any webservices and queues and ESB's right down to which queries
were used in the database, if you can't do that you're using a shit monitoring
suite.

Knowing infrastructure metrics is useless without knowing if it's actually
affecting end-users and in which use-cases.

------
nasalgoat
I hear a lot of people crapping all over Nagios, but none of the alternatives
are any better.

I recently had an opportunity to do a clean sheet build out for monitoring, so
I evaluated Zabbix, Munin, and combos of statd/Graphite, etc. and none of them
were better.

That said, I have a stock Nagios base config that I can install and have
monitoring in five minutes. The key to Nagios configs is to define hostgroups
in one file, and then create config files for each host, assigning it to a
group. Then you put the service definitions in a service file. Easy peasy.

------
rbc
One thing that is very strong is the need for a "host" with a Nagios service
definition. This doesn't map so well on to environments like Amazon EC2 auto-
scaling groups. You don't necessarily know the host names in advance. You wind
up building Nagios plugins that can monitor a pool of hosts (using cloudwatch
or whatever) and gives you some kind of aggregate status. It sounds like a
kluge to push it into the plugin, but it does allow you to use the Nagios
alerting, which is pretty well understood.

------
js2
Rejoinder -
[https://news.ycombinator.com/item?id=7340514](https://news.ycombinator.com/item?id=7340514)

~~~
gretful
came here to rebut, read your response, went away satisfied with it.

------
codingbeer
Presentations like this always reminds me why "DevOps" is very different from
system administration.

------
callesgg
I choose nagios cause there is currently nothing else on the market that is
actually better.

I think nagios is a piece of shit, but it is a working piece of shit.

------
martin_
Do we use nagios to monitor the other 6 utilities? Or what about when the
alerting gateway goes down?

------
rafekett
while we're at it, let's let graphite die too. in a fire.

~~~
zwily
Why? And replace it with what?

~~~
kylek
[http://en.wikipedia.org/wiki/Space_Pen](http://en.wikipedia.org/wiki/Space_Pen)
:)

