

Ask HN: large system monitoring - bigsystemadmin

I'm an admin at a fairly large company with plenty of database servers, virtual machines and all kinds of custom distributed software that processes a lot of data. I'm interested to hear what others are using for monitoring large systems. What kind of system did you put in place? How do you process logs, events, hardware and software failures etc. Do you use a proprietary system or something you have built yourself?
======
pj
While working as a web developer at a Fortune 500, we had these same sort of
issues. Web sites would go down and we'd get a call from someone trying to use
it. This was 10 years ago...

To learn about and fix these problems before the customer, some interns built
a scheduled web site poller that would check the response for error codes and
timeouts, push the status to a database and send an email to subscribers.

They built it, knowing very little about programming or anything, in just a
few weeks and reduced our customer calls dramatically. We were aware of the
problem and resolved it reducing customer complaints substantially!

Anyway, that kind of thing is very easy to build, just go do it. Monitor at
the application level for failures, most errors at the infrastructure level
bubble up anyway so you can catch them at the end point.

Now, if you want to do scalability projections or monitoring of non-web apps,
this won't help...

------
patrickg-zill
What kind of monitoring are you looking for?

Generally what is out there is either

a) long term graphing and trend analysis (that lets you determine that you
will need more disk space in 3 months) or

b) watching for a service to go down, in which case you send an email or SMS.

~~~
bigsystemadmin
I am after both of these but also I need a way to aggregate a lot of log files
and perform actions like "If there was no log entry X in the last N days send
an email" or "If system A wrote N entries to database and system B did not
process any in time T perform some action or raise an alarm". I would guess a
kind of aggregating of events and running scripts based on what happened. All
the systems could be modified to log to an additional place if needed.

~~~
rufo
I can't say I've used it, but this sort of sounds similar to what splunk
(<http://www.splunk.com/>) does.

------
cjwaters
We have lots of people who use Paglo to monitor data center infrastructures.
In fact we use Paglo extensively to monitor itself. The best thing is that it
is easy to feed in custom data into Paglo and trend that alongside other
information about the infrastructure. For example, I watch the side of queues
in our messaging system together with the load averages, memory and bandwidth
usage of key servers.

We also use monit to monitor and restart processes under Linux. It is great
for handling a couple of services which leak memory and need to be restarted
periodically.

------
bayareaguy
Ganglia <http://ganglia.info> is pretty easy to setup and is good for charting
simple machine statistics.

------
gsiener
Would love to hear as well, as I'm thinking about how to do the same on a much
smaller scale. Nagios/SNMP seems like a good infrastructure to build on.

~~~
gaius
Nagios is OK for "is X > Y?" type alerting, but it's very poor at "has X
changed since T?" and forget "is X greater than it was 24 hours ago AND
greater than it was 1 week ago and is Y as well, then page me".

~~~
nailer
Apparently it also has scalability problems with very large networks - ask the
Facebook Site Reliability team. They have 20,000 servers and tried to make
Nagios work repeatedly before writing their own tools

------
damir
monit & munin are of great help to me.

~~~
mmmurf
Monit has been very useful to me as well.

I also use capistrano's command: "cap shell" to easily type the same command
on all the servers.

~~~
Deadsunrise
I use monit and munin with capistrano tasks to install, update and configure
them. Right now I'm monitoring a couple hundred of containers and hosts at
work, mostly mail servers.

------
axod
cacti is pretty useful for graphing, just pings snmpd on each server, and can
do cool custom stuff

