

Ask YC: What tools are you using to monitor a site's load? - darius


======
jbyers
Consider collectd (<http://collectd.org/>). Unlike a lot of the usual
suspects, collectd is a daemon that records the usual server health stats
every 10 seconds into rrd files. After running it for two years on a few dozen
systems, it's never failed or caused undue load on its own. We often see
events that would have gone completely unnoticed in a 5-minute monitoring
window.

------
dazzawazza
I've just started testing nagios <http://www.nagios.org/>.

It looks like complete overkill for a single server and I don't know how
useful it is but it does look promising. It certainly looks like it would
scale to tens if not hundreds of machines easily.

It includes an alert structure so different events triggers different actions.
For example if the database stops responding email the DBA, if it's a router
email the network admin etc.

Again I can't vouch for it over the long term as it's only been a week or so
of testing but I can't complain atm.

It's a PITA field to research and I'm trying to avoid the 'roll my own' urges
as I'd quite like to write it ;)

Anyone else got any ideas? There is a python based monitoring application out
there somewhere that I stumbled upon about 6 months ago with a great plugin
API and neato graphs but I can't find it again :( I blame google and not my
incompetence :)

~~~
apathy
Nagios is excellent once you get comfortable with it. People have written
check scripts for all sorts of bizarre hardware, interfaces (RS-232 polling,
X10, etc.), and so forth. You can turn the babysitting proactive by writing
event scripts, eg. when a MySQL slave goes out of sync or memcached wedges,
the event handler notices, kicks it, and if it doesn't recover after a few
tries, _THEN_ you get a page. Obviously you need to be careful about what you
put in an event handler, but used judiciously, they're great.

Cacti, ganglia, and so forth are useful adjuncts for routers, clusters, and
the like. But Nagios is a time-tested warhorse with a lot of community
support. It's ugly but it works and works well.

PXE + cfengine + nagios can be pulled together for Real Ultimate Power, or at
least a simulacrum of what goes on at places like Google (whose scripts are
custom Python monstrosities for the most part, but the functionality is pretty
similar, with several levels and types of babysitter systems funneling into a
more proactive statistical resource allocation type of analysis framework
since the scale is so vast; at least that's the track that development was on
when I left, and Urs is still in charge so I doubt it's swerved much).

See here for an article on adaptive state monitoring:
[http://www.onlamp.com/pub/a/onlamp/2006/05/25/self-
healing-n...](http://www.onlamp.com/pub/a/onlamp/2006/05/25/self-healing-
networks.html) with Nagios and Cfengine. Add something like PXE reimaging of
dead nodes to the mix and you cut down workloads by an order of magnitude on
large installations.

If you are a sysadmin and haven't experimented with self-healing systems, you
should fix that gap in your skill set. If you have a sysadmin who can't or
won't, fire him.

~~~
jbyers
If you're starting from scratch, consider puppet
(<http://www.reductivelabs.com/projects/puppet/>) as a cfengine replacement.
cfengine is solid, but unless you've already mastered it, I think you'll find
puppet is at least as capable and a lot more intuitive.

------
8plot
I like munin better than nagios: <http://munin.projects.linpro.no/>

------
staunch
Cacti with SNMP is a very good single app solution to keeping an eye on server
health trends.

<http://cacti.net/screenshots.php>

Smokeping is quite useful for monitoring network and basic http request
response times.

<http://oss.oetiker.ch/smokeping/>

Nagios is good for actively monitoring services and alerting you when things
go wrong.

I generally use those and create tons of additional monitoring tools that
generate reports/charts.

------
patrickg-zill
Are you talking about site load or server load?

If you want to measure the load on a server, look into the sar utility, which
is included in Linux and all other Unix'es. It will take snapshots of data at
10 minute intervals and store them. Other tools will take that data and turn
it into a bar graph or other visualization.

------
reitzensteinm
I'd like to know this too.

Rock Solid Arcade I suspect at many points is outgrowing my cacheless hit the
database on every page setup. I started out with a cron job monitoring top
every minute, and many times the usage hits 50% - I'm using Django, so the GIL
means that's at full capacity, but I have no idea how instantaneous that
figure is.

I got some helpful advice on the Django chat room to monitor Apache, but
really what I'd like is for some warning to be dumped to a log if a connection
queue backs up for more than a second (i.e. the server is at full capacity).
That'll be time to cache/upgrade. :)

------
lemuel
nagios scales, and is eazy. You can also write scripts so that if you
daemon/application dies, it will be restarted.

------
davidw
I used nagios at the last place I worked with a number of servers, and I was
pretty happy with it, although I could very easily imagine something better.

------
darius
Thanks everybody. I appreciate all the good links. I'll go experiment a bit
with them.

------
brooksbp
Google Analytics

I believe using anything relatively more intensive is not worth the cycles.
Even 'top' is taxing at times.

