If you've ever done pager/on-call duty, you're probably familiar with a tool like PagerDuty. PagerDuty collects email alerts from your monitoring tools and sends out automated phone calls and SMS messages to the person currently on-call. The app supports many of the usual amenities in an alerting system, such as retry of unanswered alerts, on-call rotations, and automatic escalation of unanswered alarms.
Many large tech companies like Google and Amazon have sophisticated in-house on-call management and alerting systems. We have tried to build something similar for small and medium-sized businesses running critical systems.
One of the big challenges in building PagerDuty was making it simple and intuitive to use. If you find the setup process (or any other part of the system) confusing please let us know.
http://www.pagerduty.com
As an aside, I find the best way to avoid regular failures and decrease the necessity for a large operations staff is to put the individuals responsible for building the system on-call for when it fails. Your operations staff is woken up when a server crashes or a hard drive fails, and your engineers get woken up when their code crashes in the middle of the night.
If you don't do this, the costs of writing poor production code have to be levied across departments by management, rather than avoiding externalities entirely and letting engineers and operations deal with the direct impact of their implementation choices.
Of course, this is ultimately a wash if you don't also institute development methodologies to help reduce the number of production-impacting bugs, rather than simply relying on engineer's reactive fixes to one-off issues.