If you guys are writing post-mortem blog posts due to running out of disk space, your solution should really be to hire a sysadmin or ops-focused engineer. Disk-related issues are among the easiest to diagnose, so take this experience as a wake-up call that your current team is in over your heads. If you can't afford a sysadmin or don't want to bring that kind of talent in-house, you can try using a hosted solution. But make sure to really test out different hosted services before committing to one, since they can vary tremendously in terms of quality and reliability.
I have used Monit for years for basic server monitoring. It's a very tiny daemon with a single, simple config file. Basically I can tell it "when disk space exceeds X, or RAM exceeds Y, or CPU exceeds Z, or process identified by pidfile foo.pid isn't running, or I can't ping something, email me". No monitoring servers, no network polling, no SNMP, no monthly fees. Sounds like five lines of Monit config would have saved these guys. See the config file docs at http://mmonit.com/monit/documentation/monit.html .
This issue is oddly similar to issues seen at a prior gig, where MSSQL and MySQL transaction logs (replication bin logs for MySQL), consumed excess disk space when large operations did fully replicate (for various reasons), and the log volume filled.
Monitoring helps, but unless your Ops staff knows what to do with a misbehaving database (RDBMS or other), it falls on the DBA or equivalent.
I'm no server admin, but it seems to be a recurring theme where big issues are narrowed down to disk space running out. Is there not something that can automatically check this and send out alerts?
There are a lot of solutions that range from "solve this immediate problem now with minimal work by me" to "solve this problem and a host of problems that I don't have now but will have in the future". The trick is figuring out where to be on the spectrum.
For example, perhaps the simplest solution would be to cron a script that checks 'df' output and sends an email as soon as you hit some reasonable threshold.
More complex but significantly more powerful is running something along the lines of Nagios to monitor not only disk usage, but a plethora of other systems level checks.
Once that road is walked it's not a big leap to start monitoring the application itself.
Why stop there? If you've got your metrics system (like Graphite) up and running, you can pull in raw metrics and trend your disk usage over time. Write a script that pulls in the raw data (add rawData=true to your parameters in Graphite) and then set thresholds on that. Have Graphite take the standard deviation of your disk metric and now you're alerting not only on an absolute threshold, but monitoring for sudden spikes in activity.
You may also very well be able to get "more complex" without your own infrastructure ... with the tradeoff being money and relying on 3rd party SaaS. There are pros and cons involved here.
Circle back, for a second, though. Putting in a complex solution that gives you the kitchen sink requires time and money. Nagios and Graphite are adding a layer of complexity that may be totally overblown for your needs at the moment. SaaS might not fit the bill. Right now may NOT be the time to go all crazy. So start simple. Get that cron job in place today, gain a little piece of mind, and then figure out what your next steps should be.
Yes, and everyone should be doing so. For EC2 Amazon provides sample CloudWatch scripts[1] that will report additional metrics, including storage space. All server monitoring tools and services can (and should) watch your disk space.
If you're not monitoring basic problems like disk utilization and RAM you're just asking for unnecessary downtime.
1. Set up an alert at a conservative usage to make sure nothing like this can happen
2. See alert and know you have plenty of time to fix the issue
3. Get distracted
4. Disk space disaster
We use AppFirst for our monitoring alerts. One thing they don't support is sending recurring alerts while something is over a threshold. They only send when thresholds are crossed.
Right now we're experimenting with PagerDuty reading the AppFirst alerts and then seeing it as an open issue.
Parsing the output of df in a cronjob and echoing an error message is a trivial thing to do. Run that cronjob every 30 minutes and configure mail correctly on your box.
This was the only alert I wrote myself for my startup (the rest are powered by @newrelic). Saved me many times. Usually only happens when some log goes out of control unexpectedly.