Ah but it's really encapsulating a decade plus of running clusters on limited resources into scripts that automated nearly all of the detective work on fault analysis, resulting in very clearly signaling of problems. Combined with good process for managing "microfaults" (Greg's term not mine) led to a well instrumented (only kept the data that was important for diagnosing the faults) and robust in the presence of nearly any failure.