

Ask HN: What do you wish monit/cacti/nagios/zabbix/etc would do? - Skywing

My co-workers and I have been writing some software to monitor the activity and health of our servers at work. We found that many of the available solutions were either painful to deal with and setup or expensive. We were hoping to find some sort of agent-based hosted solution similar to most of the current web analytics services. We couldn't find one we liked so we're creating one. We're at the point where we're just pre-processing the incoming data that we're tracking.<p>What type of info and data do you all think we could all use for server analytics/metrics?
======
bhuga
I've thought about creating this about 5 times; the existing services really
haven't gotten it quite right, IMO. For me, exactly what metrics to measure
are not as important as the poor correlation tools. The usual suspects are
fine for 'easy' and the non-usual suspects will always require custom scripts.

More important for me is letting go of 'server stats'. The existing tools you
mention are stuck in pre-cloud days, assuming that I care about a graph for a
particular server as opposed to the average/95th %tile/99 %tile across a group
of servers filling a role.

The best solution I've found is OpenTSDB, which lets me tag any data point,
which lets me tag web server CPU use metrics as 'web' and not worry about when
machines go up and down when I'm looking at web server CPU use. But OpenTSDB
is a pain to setup and if someone jazzed up the graphs and made it a service,
I'd buy it.

~~~
Skywing
Correlation is also why we started this. A little story: we were using Cacti
on a cluster of ours. I was showing a friend of mine the graphs, only because
I was explaining our infrastructure setup. I noticed one had unusually high
CPU load. The server with the high load was one of our SQL database servers. I
SSHd into it and sure enough it was MySQL causing the load. But ... why? So,
after so log file browsing, we found that our site was being scraped by some
bots, and the particular page being scraped was doing a sorted query on a
small range of data, which was causing the entire table to be sorted then
scanned, which was causing high CPU usage, we thought. So, the correlation
there between high CPU usage and the cause was less direct, and a simple graph
was only enough to alert me of the fact that there was a problem on that
machine. Plus, I wasnt even alerted - I stumbled upon it.

So, this is something we're aiming to fix.

~~~
bhuga
If you can correlate semantically-parsed log data with time series to
automatically understand that particular problem, you're solving problems
harder than server monitoring and applying them to server monitoring, IMO :)
Not that there's anything wrong with that!

------
garyrichardson
I feel that Nagios style system monitoring is done. Writing Nagios plugins is
fairly easy, and tools like Opsview give an API for adding servers. Monitoring
disk space, memory, CPU, processes, etc works great in any of the above tools.

The problem with them is reporting data tends to be in RRD silos and doesn't
work well at the application layer.

I store metric data in Graphite from my application. If there is something I
want to monitor for performance purposes in my application, I put markers
around it and this information goes into Graphite. This can be graphed, or CSV
data can be pulled.

I want a tool that can monitor values stored in Graphite and notify me if
something is out of range. I've been thinking about building a Nagios plugin
that does that, I just haven't gotten around to it yet.

------
slysf
An absolutely necessary feature for me is auto-discovery. Zabbix really has
this down with recent versions. If your system has comparable
templating/groups and the ability to autodiscover in IP ranges you'd be +1 on
most solutions. If you then added better dashboards (ala Zenoss?) and the
ability to acknowledge events from emails etc then you'd really have something
I'd try out.

Here's the bar for me: If I spend a few solid days on zabbix per "server
class" (ie: mysql, web server, etc) I can have everything I need automatically
monitored with zero effort for me going forward (to ~1000 servers). That said,
acknowledgement of alerts, etc is a huge pain in zabbix.

~~~
Skywing
Alerts, in various forms, are definitely one of our top priorities. That was
the spark that ignited our project. So, we're focused on making those as
intuitive as possible.

As for auto discovery, we have a rudimentary form of this in place. Spreading
of the software agent is manual, but its as easy as running it on a new server
to get it up and running. The data reporting and metrics all begin
automatically.

------
Skywing
Just looking for some feedback on the initial concept. Thanks.

------
oomkiller
Scout is actually pretty good, not sure if you saw it or not.

~~~
Skywing
I hadn't seen Scout, yet. That's good to see a similar service. Having not
seen them until now, I can say that we had planned to provide slightly more
for less than they do, in terms of data logging and analytics. It's neat to
see them, because the things we have come up with are closely in line to them
except that I think we genuinely can compete.

~~~
bhuga
I looked into purchasing this exact kind of service a few months ago. I admit
that it was tough to find them--they are often not on the front page of google
--but there were 3 or 4 players in addition to varying installable tools
(sorry, names now forgotten). None really seemed to be what I want, and they
were not particularly cheap. I didn't buy any, which is a customer looking to
buy in the space and not finding a solution, so by all means give it a go. But
you're missing some homework if you had not yet found Scout, which is front
and center in this space. One of them even had a whizbang way to edit and test
your little agent plugins in the browser in ruby (was that scout?) You have
some mature competition here you should consider.

~~~
Skywing
Thanks for the response. We really started custom building our own tool, in
house, to do exactly what we wanted. We manage about 150 physical servers, so
we can't benefit from the underlying hypervisor info that some VM platforms
provide. We have been writing our own, and subsequently feel like we could
offer this at a very affordable price. We hadn't seen Scout, mostly because we
just dove right into making our own tool. We're now doing some research and
are finding that we're right on track in terms of being in line with
functionality that others provide - like Scout. We're not tied to Ruby,
though. I don't even know if our Linux servers come with Ruby - never used it.
Admittedly, ours is a little less mature, which is not a bad thing since we
are so new, but we're going to continue following our gut on this and just
make it as useful of a tool as possible.

~~~
bhuga
None of the players are tied to Ruby, but the agents are often written in it
(not ideal, as the usual 35MB memory footprint adds up over many servers).

In any event, don't take my call for research as a call not to join the party.
The fact that you haven't heard of them means that none of them have quite
solved the product/market fit stage and made it to ubiquity. Being intimidated
or imitating them would not be the way to go, but knowing about them would
probably be useful, and they have some good ideas.

(alternately, it means that most people prefer to do this in-house, and
there's just not a market there. But I don't believe that, this is such boring
stuff that happens on every project...)

anyways, good night and good luck

