Hacker News new | comments | show | ask | jobs | submit login

We use PagerDuty and couldn't be happier with the product.

Ops in a modern startup (based on my experience at Crittercism) is about 1/3 automation (deploys, backups, cronjobs, etc.) 1/3 monitoring, and 1/3 vendor/product eval (hosting, various consultants for things like database tuning)

The hardest part about monitoring isn't making the tool go off, it's (1) knowing when something is broken and (2) knowing who needs to be alerted when that thing is not working. "Tell the whole team" breeds an attitude of "this is someone else's problem", and also prevents real work/progress from happening during incident response. You have to get away from the "all hands on deck" during an incident once your company gets beyond about 3-4 engineers or your feature velocity is going to get destroyed.

Also, as your company gets larger, you'll find that managing the communication around the incident is just as important as fixing the problem. Customers HATE being left in the dark, so it's important to figure out who needs to know things are broken (internally and externally) and how that's communicated.

Heroku did an excellent writeup on this topic recently: https://blog.heroku.com/archives/2014/5/9/incident-response-... -- even if you don't adopt the full system outlined there, at least ensure you're thinking about it, especially the communication part.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: