

Handling Human Error in the Datacenter - slackerIII
http://www.spiteful.com/2008/08/11/handling-human-error-in-the-datacenter/

======
lsc
a problem everywhere I've worked, especially in places where the rack is
'organically grown' is power cables getting accidentally pulled when other
servers are added or removed.

on all servers that I have physical access to, I use zipties on both ends of
the power cable. you need a knife to unplug anything. One problem, at least,
solved.

Generally, I categorize mistakes as 'mistakes of knowledge' (that is, I did
the wrong thing because I believed something that was incorrect.) and
'mistakes of inattention' (where I knew it was the wrong thing to do, but I
wasn't paying attention and did it anyhow.)

Generally, you don't make the same mistake of knowledge twice, so I don't
worry about them much. They happen, but they only happen once. Learning, we
call it.

Mistakes of inattention are much worse, in my opinion. without further action,
I will almost certainly repeat a mistake of inattention.

The idea is that every time you make a 'mistake of inattention' you put in
place a procedure that will prevent the mistake.

~~~
seiji
Never use zip ties in a data center. Use velcro strips instead.

I don't want people trying to pop zip ties with knives near my cables carrying
production traffic.

~~~
lsc
the difference between zipties and velcro is that velcro is best for
organizational binding... zipties are structural, and should be largely
considered permanent. You shouldn't cut a ziptie holding a cable that would
take down production any more than you should move a rack while production
servers in it are still running. If it is the sort of thing you might want to
move while the servers are running (like, say, ethernet, or a bundle of cables
that goes to more than one server) like you said you should use velcro or
something else temporary.

------
sharjeel
Also, if your scripts have any dev mode features for testing (such as cleaning
up some database values and regenerating, removing some files etc), make sure
that you are unable to execute them on production or some sort of confirmation
is required.

I had a script on my server that did clustering of stories from different news
sources. The script also had some test methods which deleted all the clustered
data and rebuilt it. I once accidentally ran the "cleanup method" on prod
server and that created disaster because somehow cascaded deletion took place.
I had to refer to replay log to get everything back and took hours of efforts
plus a lot of pressure. From then onwards I placed a check on each of my
script to get a confirmation twice before executing any such test method on
prod server.

------
sysop073
For his last suggestion about coloring the terminal background, it might be
easier to just color the name of the machine in the prompt

e.g.: <http://i34.tinypic.com/5ecthx.jpg>

~~~
slackerIII
Ah, changing the color of the prompt is a much better idea. Thanks.

------
a-priori
You should also look into software such as Puppet to reduce the amount of
manual administration you have to do.

