
Here’s what happens when Heroku goes down - craigkerstiens
http://gigaom.com/cloud/heroku-exec-takes-us-behind-the-scenes-when-clouds-fail/?utm_source=dlvr.it&utm_medium=twitter
======
blantonl
One way to reduce outages and bugs would be to put an incentive plan in place
that pays _bonuses_ to the folks on-call for each time they fix a problem
while on-call.

This would have two effects:

\- Management would strongly encourage the design and implementation of
systems that fail less, which results in less payout of these bonuses.

\- Employees would want to more willingly be part of the on-call process.

Imagine a group of your workforce eagerly waiting to fix an impending failure
while another opposite group is eagerly making sure those guys don't get paid
a bonus.

And, you could tie it all together with bonuses for everyone when you meet
certain performance levels.

~~~
mobileman
And employees would also put bugs in to reap the reward.

~~~
blantonl
I certainly don't disagree that there are opportunities to game the "system,"
but in startup environments there are lots of people keeping a close eye on
development and production environments. Someone gaming it will quickly be
exposed.

Just to clarify, I'm not recommending this approach to enterprise corporate
environments where layers upon layers of management and developers could
easily derail what I am suggesting.

But for the startup, it is worthy of consideration.

------
mef

      In most cases the pages, which arrive about two or three
      times during a 24-hour on-call period, require the engineer 
      to take down the problematic instance and restart it.
    

I've never held a position that required a pager, but I always assumed that
pages to on-call support people were for emergencies only. This seems like
they're paging for non-emergency things that need to get looked at; wouldn't
an email work just as well?

~~~
joevandyk
When an instance is "degraded", they need to start another instance quickly,
or else the apps hosted on that instance will be down for good. So it's an
emergency.

I think they don't have enough confidence in their detection of "degraded"
instances to do it automatically, so it requires human intervention.

