

I Had Downtime Today. Here’s What I’m Doing About It. - soundsop
http://www.kalzumeus.com/2010/02/21/i-had-downtime-today-heres-what-im-doing-about-it/

======
rm-rf
Is see two threads here. (1) How do you communicate with affected customers,
and (2) Root cause analysis, ruthlessly done, with no BS and no excuses.

On (1), this appears to have been an outage that only a subset would have
noticed, so the direct message to affected users looks like a good way to
communicate. I've seen other ways of handling that, both 'broadcast' and
'denial', and neither is good.

On (2), the OP's use of root cause and the unambiguous identification of
cause/effect/remediation is something that others should emulate.

If you cannot identify cause of failure and build
mitigation/detection/prevention for every failure, every time you fail, you
are destined to spend the rest of you career reacting to broken crap.
Determining root cause of every single failure is lots of work, but so is
reacting to broken crap every day and night. That gets old after a while.

~~~
patio11
Thanks, the praise means a lot to me.

One factor I forgot to mention: prevention can be scheduled, incident response
cannot. This is really a quality of life issue for me: if I can pick the hour
I do the preventative work in, then I can pick an hour where the work will
have minimal impact to the parts of my life which happen away from a computer.
(All appearances to the contrary, I do occasionally get away from a computer.
Not very much these last few weeks, granted, but soon hopefully quite a bit.)

However, outages requiring my immediate attention will probably happen when I
want to be doing something else -- like sleeping, or spending time with
family, or wooing a young lady, or eating unhealthy food and singing karaoke
songs with the guys. Obviously, its in my best interest to minimize that, even
more than it is generically in my interest to maximize my productivity.

~~~
bemmu
I've noticed you haven't usually written much about your personal life. But
I'm curious, I'll be moving to Tokushima this August, do you live anywhere
near there?

~~~
patio11
I live in the general vicinity of Nagoya. Tokushima is approximately 3.5 hours
from Nagoya by the fastest available public transportation, on another island.

------
bendtheblock
Writing up details and actions after an outage like this is a good practice
for improving the perception of a service. The compliment to taking re-active
action like this (e.g. informing users and make underlying tech improvements)
is to try show uptime in a passive way that the users register. My old boss -
before I had a startup - would often use the analogy of the London
Underground, which for a long time was perceived as a poor service, until they
introduced regular status updates on all lines in each station. This made
people register all the times that the service was working for them (and many
others, by showing the status of all lines), so that when there was a problem
they saw it in context as a small proportion compared to 'uptime'. Not sure
how you could apply that in this example, but it might resonate with more
'infrastructure' like services like web hosts.

~~~
patio11
I am sympathetic to the overall thrust of this, but posted on my blog (which
vanishingly few customers read) rather than my main site for a reason. If your
toaster can't toast, you don't want to hear why your toaster can't toast, you
just want it to toast toast. This describes the relationship of almost of my
users with their computers, the Internet, and Bingo Card Creator.

While I want to apologize to users who got delayed from getting back to their
lives because their toaster was on the fritz, I don't want to tell anybody
else that toasters sometimes can't toast toast. It needlessly complicates
their relationship with their toaster: they have no relationship with their
toaster, and that is how they bloody like it.

~~~
bendtheblock
Yes that's a good point - I think in most cases you're probably right and
simplicity (e.g. attempting to not show irrelevant information) is more
important than the telling them that their toaster is toasting.

------
cglee
We recently wrote a blog post about monitoring DJ workers with Webmin that you
may find helpful: [http://blog.activeinterview.com/post/2010/01/28/use-
webmin-t...](http://blog.activeinterview.com/post/2010/01/28/use-webmin-to-
monitor-rails-delayed-job-or-anything/)

~~~
patio11
Thanks for the suggestion and the blog post. I'll look into whether adding
Webmin to the server makes sense. ("Every additional moving part is a future
source of failure." -- popular maxim locally.)

------
oomkiller
God really is just a horrible way to manage processes. Either use monit or
bluepill, they are MUCH more reliable.

