

Facebook App Outage Postmortem - lacker
https://developers.facebook.com/blog/post/2013/08/15/summary-of-the-august-13th-app-outage/

======
WestCoastJustin
Compare this with Google's "API infrastructure outage incident report" from
earlier this year [1], you'll notice many headings, like a Summary, Timeline,
Root Cause, Resolution and recovery, Corrective and Preventative Measures,
which have detailed times and numbers.

Maybe we (the IT community) need a framework for incident reports or
postmortems, or just use Google's as a model?

[1] [http://googledevelopers.blogspot.ca/2013/05/google-api-
infra...](http://googledevelopers.blogspot.ca/2013/05/google-api-
infrastructure-outage_3.html)

~~~
slipperyp
I doubt this is the internal audit of the outage event. At least I hope it
isn't, but I could be wrong. The template from Google isn't super uncommon and
I bet most organizations that really want to drill in and understand root
causes and prevent recurrence use something very similar to it (if the
organization is serious about uptime and if they're successful at reducing
recurrence).

It doesn't seem crazy to me that Facebook's publicly facing summary of this is
as casual as this seems to be. They owned up to breaking their platform and
indicate they're taking measures to not do it again. But if the person who's
internally accountable for analyzing this and preventing recurrence told me
"we're building better tools" without any specifics about those tools, who's
got accountability, or the timeline they anticipate putting those in place,
I'd say they should pack their bags, so I bet there's a more detailed plan
internally. I'm also not a facebook app developer, though, and if I had any
revenue depending on not being shut down like this, I might be more frustrated
with this either a) poor level of transparency (giving them the benefit of the
doubt) or b) poor depth of analysis.

~~~
mayank
> At least I hope it isn't, but I could be wrong.

I used to work at Facebook, and this is most definitely not the internal
audit. A lot of Facebook engineers are former Googlers, and bring a lot of the
culture and practices with them. You can rest assured that people are
hunkering down in a conference room as we speak.

That said, Google's postmortems are a thing of awe and distributed widely
within the company.

------
Pxtl
Facebook has some of the best uptime in the industry, so they're a company
that I'm inclined to believe when they post stuff like this.

~~~
cheald
Compared to whom, exactly? The unreliability of Facebook's developer platform
is practically a meme now.

------
techscruggs
On an orthogonal note. Seeing an image of a man smoking a cigarette and
drinking a beer next to the post-mortem doesn't exactly inspire confidence.
This really seems to be pushing the limits of professionalism.

~~~
eugenez
If it makes you feel any better, that photo of me is from our holiday party -
and the 'cigarette' is a prop (I do not smoke). The manhattan in the other
hand was real and delicious.

------
dtx
This is anything but a "post-mortem".

------
bowlofpetunias
I'm tempted to ask how the production data differed from the data in the test
environment on which they ran these potentially destructive tools first, but
I'm afraid that would be interpreted as rhetorical sarcasm.

Seriously though, how on earth do you even get the idea to initiate such an
operation without a) testing it against realistic data, and b) doing a dry run
against the live data first before you decide to pull the trigger and start
terminating applications?

And to top it all off, there was no roll back scenario and the restore process
was buggy, which means it wasn't tested properly either.

I'm all for "move fast and break things", but this is a complete joke.

The organizational chaos that leads talented engineers to operate this way
must be a nightmare.

~~~
afriesh123
Yet the market cap of the company is currently $111 billion. There's a lesson
for us (or an opportunity) in there somewhere.

------
Ricapar
And nothing of value was lost.

