perl

mixmastamyk · on Feb 21, 2024

Hah, was imagining some kind of organization breakthrough… “we wrote scripts,” not what I expected.

ChuckMcM · on Feb 21, 2024

Ah but it's really encapsulating a decade plus of running clusters on limited resources into scripts that automated nearly all of the detective work on fault analysis, resulting in very clearly signaling of problems. Combined with good process for managing "microfaults" (Greg's term not mine) led to a well instrumented (only kept the data that was important for diagnosing the faults) and robust in the presence of nearly any failure.