Microsoft 1 -> ../../Organization/Microsoft/Outage 1
> Its intent is to address issues involving risks to the public in the use of computers. As such, it is necessarily concerned with whether/how critical requirements for human safety, reliability, fault tolerance, security, privacy, integrity, and guaranteed service (among others) can be met (in some cases all at the same time), and how the attempted fulfillment or ignorance of those requirements may imply risks to the public. We will presumably explore both deficiencies in existing systems and techniques for developing better computer systems -- as well as the implications of using computer systems in highly critical environments.
The performance tanked because the working set now required a disk hit.
Each, and every, query, required a disk hit.
That many IOs/sec, irrespective of the size could be enough.
My outage anxiety is reducing, even if contingencies are in place.
I think post-mortems are a huge opportunity to learn about resiliency and common system and engineer errors, I know I grew as an engineer with each of those.
Thanks for sharing!