
Why Big Monitoring Software Sucks - craigkerstiens
http://obfuscurity.com/2012/06/Why-Big-Monitoring-Software-Sucks
======
suprgeek
It is easy to beat up on big complicated monitoring software - call it
Enterprise Level and then proceed to find flaws. There is a reason it is
complicated - because it tries to solve Tough problems: Can these other
offerings give me:

1) High-availability with minimal downtime (seconds)

2) I18N readiness & L10n language packs

3) Scale on the order of 10000 or 100K

4) Guaranteed delivery of Alerts and Metrics

5) Easy Deployment and configuration - where the first step is NOT download
and deploy Redis and configure it.

Make no mistake I love open source offerings but having worked in this domain
for a while, Enterprise software becomes complicated to solve complicated
problems with minimal intervention by users - not every IT shop has super-
duper DevOps ninjas who Crunch Machine Learning Pardigms for breakfast.

~~~
gouranga
Excuse the cynical POV here, but it's from a number of years of dealing with
such things. The problems I find with enterprise software as a rule are it
solves the above but fails at:

1) Realistic scaling. Just about every piece of enterprise grade software I've
seen falls off a cliff when it hits a certain load point. This is ALWAYS
enough to sell it to you. When it goes wrong...

2) When it goes wrong, it's a nightmare of maintenance contracts,
verification, finger pointing and telephone calls that last hours.

3) It rarely works as advertised. I mean literally 1% works as advertised.

4) Trials and realistic evaluations without crazy constraints are not usually
possible (this is changing slowly).

5) Installation is a breeze but when it comes to backup/restore and upgrade,
it's a pain in the butt.

Time is money. The difference between "enterprise level software" and "OSS"
platforms are the following:

* OSS time is spent up front.

* Enterprise time is spent later on and costs more up front.

I'd rather have one Ninja on the team and lose the enterprise software. Ninjas
scale better as well.

~~~
edbloom
this.

I do a lot of work in Enterprise CMS land and Open Source land and you've
essentially described my experiences working with ALL Enterprise CMS's I've
used since the early 00's. I'm pretty convinced that within about 18 months
Open Source CMS's will have negated any residual technical benefits that
Enterprise platforms offer.

At this point the only competitive advantage Enterprise CMS vendors will have
is their reputation and the "nobody ever got fired for buying IBM" attitude
and commercial support.

With many other businesses sprouting up to provide commercial support for open
source platforms I can see many Enterprise vendors having to pull up their
socks and shaking up how they sell and manage their platforms in the very near
future.

~~~
mrdodge
My favorite aspect of this is that enterprise support is often very bad and
unknowledgeable about the product if you try to do something that's the least
bit uncommon or outside their script.

You would think people focused on knowing just one product would know
something about it.

~~~
gouranga
Not at all - they know shit. Unfortunately "enterprise software companies"
hire the lowest bidder in the lowest bidding country and they hire the lowest
bidding staff.

This is especially true of Microsoft who I've had the pleasure of dealing with
their highest level of Gold Partner support with respect to an IE9 bug that
broke ClickOnce entirely.

Basically: absolutely fucking useless, blame the client, wriggle out of having
to do anything.

That was until they met me. 35 phone calls (I shit you not), 3 heated
arguments, spread bad press all over stackoverflow and MS connect, blog whinge
and finally a half arsed registry fix that we had to deploy across 2000
disparate clients!

6 fucking months it took and we're a MS Gold Partner. It cost us more than our
subscription cost in bad rep, time and support costs.

------
josephruscio
Could not agree more. This is the approach we've strived to take while
building <https://metrics.librato.com> ... API access for everything,
integration with popular OSS for metric collection (e.g. statsd, collectd),
loosely coupled integration with complementary tools (e.g. Papertrail,
Pagerduty), etc.

~~~
obfuscurity_
@josephruscio - I was >this< close to going back and giving a quick mention to
Librato Metrics. You guys definitely "get" what I'm talking about. I like that
you provide well-defined interfaces for easily getting data into and out of
your application. You focus on trending and let other (better?) software
handle the other stuff.

------
sigre
So, what would a "small, sharp" monitoring tool look like that's compelling to
customers? Just an agent, and the customer supplies their own notification and
trend-reporting services?

~~~
sciurus
For example, [http://www.control-alt-del.org/2012/03/28/collectd-esper-
amq...](http://www.control-alt-del.org/2012/03/28/collectd-esper-amqp-
opentsdbgraphite-oh-my/)

------
ghshephard
To some degree, the best "monitoring software" doesn't even monitor at all -
it simply provides a framework for people to add their monitors to. To take
one example I'm familiar with (though I'm sure this could be applied to the
rest of the great monitoring systems) - Nagios is, at it's heart, a State-
Tracking/Notification/Scheduling Engine. The fact that you can add commands
like "ping" or "http test" or "Database Schema Verification" to the tasks that
you are scheduling and tracking the state on, is almost incidental to the core
of what Nagios does.

~~~
spudlyo
I agree completely, and I think Nagios does a good job at state tracking,
notification, and scheduling. The interface is a bit clunky, but with the
nuvola makeover it's completely serviceable. Pair it with something like
ganglia or cacti for visualization and you're most of the way there.

~~~
LaSombra
Take a look at Opsview. It runs on top of Nagios 3 and makes you not edit
those pesky configuration files and integrates with MRTG. Pretty neat.

------
beedogs
I swear I've seen HP OpenView processes using more CPU than Oracle. I have no
idea why companies throw away so much money on such terrible software.

~~~
spudlyo
Monitoring systems that have to execute thousands of active checks every
polling interval consume a bit of CPU. I've certainly seen high CPU usage and
run queue depth on a busy Nagios server.

~~~
beedogs
On a central server, I can understand. These processes were running on
clients, though, and consuming as much memory and CPU as they could find.

------
antoncohen
Sounds like Sensu (<https://github.com/sensu/sensu>) fits the requirements.
It's a small monitoring framework that works in conjunction with Chef or
Puppet, PagerDuty, Librato, Graphite, and others.

~~~
sciurus
Sensu came up several times in the recent monitoring discussion on devops-
toolchain [0]. I'm going to try a setup with sensu, graphite, and collectd
soon- like the one Sean Escriva described at ChefConf [1].

[0] [https://groups.google.com/group/devops-
toolchain/browse_thre...](https://groups.google.com/group/devops-
toolchain/browse_thread/thread/e5dac7a03dfb4fbf) [1]
<http://www.youtube.com/watch?v=BXxtdE-Paco>

------
tpsreport
Nice article, whose central message ("don't bundle your tools") is widely
applicable to domains other than monitoring. Small, specialized tools that can
be combined together is the very essence of the Unix philosophy.

~~~
dasil003
The problem is particularly acute in monitoring because you tend to need to
monitor many disparate systems and subsystems in a very fine-grained manner.
It needs to be reliable and not impact performance no matter where it is run
which usually has to be everywhere. The Unix philosophy is definitely useful
on a wider basis, but monitoring could be the poster child.

------
shortlived
Yeah but does it scale?

Edit: I should clarify, I really liked the article and I like the unix
philosophy but I'm truly interested to know what type of networks can be
monitored with this approach. I work in the enterprise monitoring space and a
lot of the solutions _are_ truly horrible, but on the other hand, I'm not sure
this approach would work either when you have 500K managed nodes and dealing
with multiple campuses across 3 continents...

~~~
obfuscurity_
Yes, it certainly /can/ scale. In fact, a properly modularized monitoring
solution /should/ be more capable of scaling if it was designed with this in
mind. This is certainly the approach we've taken with our monitoring and
trending components at Heroku. We don't have anywhere near 500K nodes, but the
principles scale.

