

Lessons Learned from a Redis Outage at Yipit - vacanti
http://tech.yipit.com/2012/09/27/lessons-learned-from-a-redis-outage-at-yipit/

======
23david
If you guys are writing post-mortem blog posts due to running out of disk
space, your solution should really be to hire a sysadmin or ops-focused
engineer. Disk-related issues are among the easiest to diagnose, so take this
experience as a wake-up call that your current team is in over your heads. If
you can't afford a sysadmin or don't want to bring that kind of talent in-
house, you can try using a hosted solution. But make sure to really test out
different hosted services before committing to one, since they can vary
tremendously in terms of quality and reliability.

------
tedchs
I have used Monit for years for basic server monitoring. It's a very tiny
daemon with a single, simple config file. Basically I can tell it "when disk
space exceeds X, or RAM exceeds Y, or CPU exceeds Z, or process identified by
pidfile foo.pid isn't running, or I can't ping something, email me". No
monitoring servers, no network polling, no SNMP, no monthly fees. Sounds like
five lines of Monit config would have saved these guys. See the config file
docs at <http://mmonit.com/monit/documentation/monit.html> .

------
bsg75
This issue is oddly similar to issues seen at a prior gig, where MSSQL and
MySQL transaction logs (replication bin logs for MySQL), consumed excess disk
space when large operations did fully replicate (for various reasons), and the
log volume filled.

Monitoring helps, but unless your Ops staff knows what to do with a
misbehaving database (RDBMS or other), it falls on the DBA or equivalent.

------
jtreminio
I'm no server admin, but it seems to be a recurring theme where big issues are
narrowed down to disk space running out. Is there not something that can
automatically check this and send out alerts?

~~~
spullara
This was the only alert I wrote myself for my startup (the rest are powered by
@newrelic). Saved me many times. Usually only happens when some log goes out
of control unexpectedly.

~~~
johnjones
I've also seen it with RabbitMQ persisting a lot of messages that needed to be
queued up

------
matthewowen
Reading that site is like the way I imagine having cataracts must be.

Please, more contrast between text and background. It's like reading through a
haze.

