

Ask HN: Sysadmin uptime gotchas? - imp

Would anyone like to share any semi-rare sysadmin problems they had that caused problems for a website?  I had two problems recently that I didn't anticipate which caused some downtime.  It's for a website with 700 req/sec, LAMP setup, 1 web server, 1 DB server on EC2.<p>* Sporadic mysql_connect errors at high traffic periods.  Cause was some combination of hitting the limit on max open files and ip local port range on the web server (http://www.tigase.org/content/linux-settings-high-load-systems)<p>* All database connections from the web server were refused because the database hit the max_connect_errors setting in mysql, which is only 10.  Flushed hosts and increased the setting a bit to fix.<p>Does anyone else have other "black swan" events that they could share to help us soon-to-be-great sysadmins?  We don't have to learn everything the hard way :)
======
spooneybarger
This is an obvious one but important... make sure you have log rotation set
up. I had an admin who was in a hurry when doing an install and forgot to
setup log rotation on a heavily used piece of software which eventually lead
to the hard drive filling up. There should have been a low space warning email
sent, the log rotation was should have been set up but because someone was
hurrying, neither happened.

Not really a black swan, but I've found the majority of errors come from
people rushing and not remembering to do everything because of workload.

take away: Checklists are a very good thing to have.

~~~
imp
Yeah, that's a good one too. I learned that lesson the hard way a while back.
Checklists are a great idea.

