

Failure Friday: How We Ensure PagerDuty is Always Reliable - DougBarth
http://blog.pagerduty.com/2013/11/failure-friday-at-pagerduty/

======
jedberg
I posted this on the blog but I thought I'd repeat it here:

The simian army isn't AWS only. :) Some of it runs on other stacks.

And the best part is, it is open source! So if you wanted to leverage the
simian army, it wouldn't be that hard to modify it to run on whatever stack
you want and then submit the changes back. :)

------
teh_klev
We just started using PagerDuty to deliver our Nagios alerts to landlines and
mobile phones after losing confidence in Vodafone's pager network.

The other thing we like is the integration with HipChat to deliver alerts into
our NOC chat room.

Overall we've been quite impressed....will be more impressed if you folks run
into actual trouble but we still get our alerts :)

~~~
ultrasaurus
That's great to hear (HipChat integration is my possibly favourite feature).

We do occasionally post about trouble that we've survived,
[http://blog.pagerduty.com/2012/07/a-utc-leap-second-vs-
derec...](http://blog.pagerduty.com/2012/07/a-utc-leap-second-vs-derecho/)
caused us some mild stress but no missed alerts.

------
mjallday
Annecdotal I know, however: pager duty is the only service we rely on that has
yet to go down on us. These guys are solid!

I like that tip on how to simulate a slow network too.

------
kapitalx
My first impression from the title was that this is a post-mortem for an
actual failure on Friday. But after reading your post the title made more
sense ;)

Great post!.

------
iLoch
It's Wednesday!

