Hacker News new | past | comments | ask | show | jobs | submit login

When you get an alert, you have to first understand the alert, and then you have to figure out what to do about it. The majority of alerts, when people don't craft them according to a standard/policy, look like this:

  Subject: Disk usage high
  Priority: High
  Message: 
    There is a problem in cluster ABC.
    Disk utilization above 90%.
    Host 1.2.3.4.
It's a pain in the ass to go figure out what is actually affected, why it's happening, and track down some kind of runbook that describes how to fix this specific case (because it may vary from customer to customer, not to mention project to project). This is usually the state of alerts until a single person (who isn't a manager; managers hate cleaning up inefficiencies) gets so sick and tired of it that they take the weekend to overhaul one alert at a time to provide better insight as to what is going on and how to fix it. Any attempt to improve docs for those alerts are never updated by anyone but this lone individual.

Providing a link to a runbook makes resolving issues a lot faster. It's even better if the link is to a Wiki page, so you can edit it if the runbook isn't up to date.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: