A lot of times large scale outages like this are because of the redundancy. The whole system is interlinked with automatic failover. then it hits a corner case that was not engineered into the fail model and you get cascading system failure where each node starts bringing down other nodes automatically. basically the lesson is: In highly interlinked systems you get highly interlinked failures.
And then after a lot of angry words and finger pointing this new failure gets added to the failure model.
My personal takeaway after chasing the long tail of automatic failover on a few projects, is quite often it is better to drop a few 9's from your service goal, decouple some of the systems, and accept that while some parts of the system may go down, it should not bring everything down with it.
And then after a lot of angry words and finger pointing this new failure gets added to the failure model.
My personal takeaway after chasing the long tail of automatic failover on a few projects, is quite often it is better to drop a few 9's from your service goal, decouple some of the systems, and accept that while some parts of the system may go down, it should not bring everything down with it.