

Ask HN: What do you use to monitor your web application? - boundlessdreamz

What do you use to monitor<p>1. Requests/second<p>2. Average Request Time<p>3. Exceptions<p>3. Background Queues<p>4. Mysql/Postgres servers<p>5. Uptime<p>6. Memory usage<p>7. Disk usage<p>If you have any tips/scripts/tutorials, please share :)
======
dmytton
These are all important metrics but they're available from different things.

You will have application specific metrics - errors, user counts, message
queue sizes. These will help you understand what is happening in the app
itself, let you fix problems and ensure that things are running smoothly. For
this we use FogBugz to report errors and have various user metrics store in a
database.

The next level is the daemons and services used by the application. Apache
will report the number of req/s and MySQL can tell you about the number of
connections, or the status of replication. These are important for planning
growth and ensuring your server capacity is sufficient.

Then there's the servers themselves. This is CPU load, memory usage, disk
usage, etc. This ties closely to the services above.

There are a lot of tools for doing these things, but in the end you have to
tie everything together yourself. You might start with simple website response
monitoring with something like <http://www.pingdom.com> then move on to having
it triggering your own check scripts to alert you when custom stuff goes wrong
(e.g. queue sizes).

And of course I'd recommend my company's server monitoring product
<http://www.serverdensity.com> ;) but you could spend some time setting up a
bigger, more flexible (and thus more complex) tool such as Nagios to pull all
the metrics in.

------
dpritchett
I have trouble imagining something more useful than Munin for a *nix server.
Check out this screenshot - it only shows four of the tens of graphs available
at any given time: [http://dpritchett.posterous.com/test-driving-munin-
service-s...](http://dpritchett.posterous.com/test-driving-munin-service-
statistics-on-a-la)

Edit: Munin profiles services and responsiveness. You'll probably want to
complement it with Nagios to monitor service/hardware failures.

~~~
aw3c2
I use Munin even on my local work computer.

And vnstat for realtime and aggregated network traffic. Conky for volatile
realtime cpu/ram/etc.

------
dangrossman
A Windows 7/Vista desktop gadget. Actually, the gadget is just a tiny
(400x300ish) iframe showing a webpage with some small text. It sits in the
corner of one of my monitors where I can always see it.

This webpage is a list of servers (front and back end) and their load
averages.

At least for all the webapps I run, which serve hundreds of requests per
second to tens of thousands of users, load average is a good enough proxy for
all the metrics that matter.

When CPU usage on the front end servers is getting high enough that it impacts
response times and throughput, that shows up in the increasing load averages.
When disk IO on the database servers is getting high enough that it impacts
response times and thoughput, that shows up in the increasing load averages.

The web servers each have a webpage on them that has the load average as its
only output. I wrote up a little script for the database servers to query the
load remotely on the DB servers. The webpage I use as the desktop gadget polls
those webpages/scripts and prints out the server names and their current load
as its output.

As for uptime monitoring when I'm not around, I used to use AreMySitesUp.com.
Nice, simple service that does what it's supposed to. Even with a free plan it
seems to check every 15 minutes or more often, and if a site goes down, it
checks again much more often until it comes back up. It'll also tell you the
exact status code if it's not in the 200-300 range, so you can tell a server
is 500'ing out but online, and will notify you of timeouts.

I built my own uptime monitoring service, though, so I don't need that
anymore. I run it using VPS accounts at different providers so I don't get
stuck with the sites and the service monitoring them going down at the same
time. Twilio just added SMS to their API, so adding SMS alerts to any custom
monitoring scripts is dead easy now.

------
dryicerx
If you're addicted to every statistic possible, check out collectd
<http://collectd.org/> . It's a tiny daemon that polls every couple of seconds
pretty much every server and services stat imaginable (cpu, mem, disk, snmp
from routers, get vals over http from your app, apache, *sql, etc) without
really any noticeable performance hit. You can output it all RRD files and
view it in a variety of ways. You can also use use cacti, munin (by them
selves to collect and graph, or use collectd to collect and use either of
these to graph as well)

------
sunkencity
Nagios - measuring that the application and database is up, and that disk and
load are ok. (also ping but that is redundant).

I use exception notifier for rails. Simple, but it works. Might switch to
newrelic.

From time to time I also use pingdom.com to see that the site is available
from different networks around the globe.

------
chuhnk
1\. I dont actively use Apache extended status or req/sec monitoring because
it slows down serving requests. Instead I look in the logs every minute and
feed the count into ganglia.

2\. Again apache logs into ganglia. New Relic for rails. Firebug and yslow
when testing.

3\. Exception notifier in rails, ErrorDocument in apache that triggers a php
script which notifies by email. Log4j in java.

4\. Redis, OpenMQ for queueing, Resque for background jobs

5\. MySQL only because at the time postgres did not support replication.

6\. AlertSite

7\. Monit

8\. Monit

I am always on the look out for better ways to monitor the servers at work. I
never really liked nagios. Monit is extremely easy to use and is very good at
what it does. MMonit is a management console that can centralize the
monitoring of all servers. Ganglia is great for metrics, flickr can vouch for
that one.

------
spooneybarger
Our primary means of monitoring is to do fullpage requests to different parts
of the application every 5 minutes and check the results. We have a historical
record from multiple locations around the world that we can use to see if
changes were made that had a negative impact as well as automated alerts when
the full page load takes longer than expected.

We then have secondary monitoring that can allow use to isolate issues within
different areas if primary monitoring indicates a problem.

------
wooster
For stats, I use Ganglia. The newest versions, which sadly have to be
installed from source on most Linux variants, have support for Python modules.
So, in addition to the out of the box stuff, I monitor a whole bunch of other
stuff related to how my applications are doing (signups, item counts, etc).

For daemon monitoring and watchdog-ing, I use monit. It emails me if there are
problems with the server, and can automatically restart processes if they fail
or get out of control.

------
ianmcgowan
It's early development (pre-alpha), but <http://www.leemba.com/> looks
promising.

~~~
mrj
Check out the demo (it's currently on a virtual host so it might get destroyed
:-)

<http://demo.leemba.com/>

------
imagetic
<http://newrelic.com> has been pretty critical, but we're a Rails only shop.

~~~
briandoll
+1 for New Relic

I wouldn't consider building a production Rails app without it.

~~~
boundlessdreamz
I've testing scout as well. Was easy to setup. Interface could use a bit of
work though.

------
rajuvegesna
If you are looking for some commercial solutions, there are ton of them like
AppManager (appmanager.com). But then, some of these solutions monitor what
you expose from your application.

I don't know if there are any hosted solutions though.

------
JangoSteve
For all of my Rails apps, I use a combination of Munin (for trending and
profiling over time), NewRelic (for daily and weekly profiling and
optimization) and AreMySitesUp (for 15-min pinging).

------
doriang
ScriptCanary - monitors Javascript errors

------
fseek
I would say all of them. Generally top + atop + the log files should be
enough.

