The answer is most certainly "both". The system can be used to identify problems you don't know you had (generally sending an email / page). Once you've identified a problem you may have it pull instances out of our discovery service but keep them alive so you can ssh in and triage. Once the problem has been identified you can have it mop up by terminating the services displaying the behavior while you work on a fix.

We throttle automatic terminations so that it doesn't drop an entire cluster at once. Yet to cause an outage, fingers crossed!

We generally run into two classes of errors: 1) Software bugs which follow the process outlined above. 2) Issues with AWS... an example being some virtual servers running on hardware experiencing a network issue. If they're terminated and replaced by the ASG generally the new ones spin up on good hardware and we've avoided the issue. Rare but it does happen at our scale.

