

Summary of the December 24, 2012 Amazon ELB Service Event - rbc
https://aws.amazon.com/message/680587/

======
rbc
Here's my summary of their article. The whole event occurred because a
developer ran an ELB purge job, thinking they were purging a non-production
ELB meta-data database. It turned out that the configuration they were using,
purged production ELB meta-data instead. As they say, no good deed goes
unpunished. This caused strange errors that confounded the ELB technical team,
delaying rapid recovery.

The rest of the Amazon article details the recovery process and the after-
action items. Apparently, they had to recover at least twice, because the
first recovery attempt failed. They had to figure out how to recover the ELB
meta-data. Apparently this database didn't have a working manual procedure or
automation for recovery.

One key after-action item deals with production access and is worthy of note.
Privileged access to production was being provided to a small team of
developers, working on ELB automation projects. The privileged access was
"persistent" and didn't require per-access approval. On Christmas Eve, that
was a problem. Amazon is promising that each privileged access now requires
permission. They also claim that future recoveries will be faster, because
they understand them better. As a final note, they apologized for the outage.
Live and learn, I suppose.

