
Outage post-mortem - goronbjorn
https://tech.dropbox.com/2014/01/outage-post-mortem/
======
benjaminwootton
The great thing about DropBox is that I didn't even notice, despite using my
files across the outage.

~~~
bluedino
Is that really a good thing?

What if you had finished up a document at home, thought it synced but it never
did, and then showed up at a clients a few hours later wondering why you still
had the old version on your phone/laptop?

Wouldn't you have wanted DropBox to let you know something was going on?

~~~
bradleyland
> Wouldn't you have wanted DropBox to let you know something was going on?

This is such a hard thing to balance. I hate nagging notifications. Dropbox
makes it really easy to see what the service status is. I glance at my menu
bar/system tray icon, and look for the green check. If I don't see the green
check, I know my docs haven't synced.

~~~
hnriot
This is a glaring case of rose tinted glasses! Of course you'd want to know
you were working on out of date src/document/etc...

~~~
bradleyland
I already know that. The Dropbox icon is always within sight and provides
immediate feedback on the status of your Dropbox:

Green check - good to go! Blue cycle - syncing Blank icon - no connection

Not wanting a nag is not rose tinted glasses. If there is an improvement to be
made, it would be in the last icon. Blank doesn't exactly scream "we're down".
If the Dropbox client can't get a connection to the service, but it can see
that a network link is available, it should give some indication that it is
not connected, like some manner of red indicator.

------
cstrat
I found it hard to work out where to get the most up to date information on
the outage. I checked the blog, but the last message was their New Year
message. In the app and on the main website (mobile version) I couldn't see
anything...

Glad I read HN otherwise I don't know how I would have come across this
information. =)

~~~
kordless
Would be nice to have a service who stepped in when shit like this happened.
I'd pay good money to have a tiger team appear out of thin air when the shit
hit the fan.

~~~
kordless
I mean customer facing.

~~~
maroonblazer
Meaning a public relations function? That's what this boils down to,
ultimately; a person(s) who know the audience, understands what concerns and
questions they have and provides timely answers to them.

I thought their response struck the appropriate level of detail. I don't care
to know the inner workings of their processes, but I'd like some indication
that _they_ care and that they're working on it. I got that from this.

------
imbriaco
This is, sadly, not a great post-mortem. They missed an opportunity for
goodwill. I don't feel more confident in their level of understanding or
ability to remediate the problems that led to it after having read it. I know
they have an excellent engineering and operations staff -- this post-mortem
doesn't reinforce that, though.

A few of the things that jumped out at me after one reading:

1\. The apology is the next to last sentence. That's burying the lede. I'd
like to see that far earlier, in the first two to three sentences.

2\. The tone is overly clinical and lacks humanity. I suspect they felt that
it made them sound more authoritative and in control, but instead it comes off
somewhat robotic.

3\. There's a mixture of too little and too much technical detail. It feels
like they couldn't decide who the audience was. There were technical tidbits
thrown out without any elaboration that lead to more questions than answers.

4\. The remediations sound pretty weak. There's no discussion of the human
factors like how the recovery process went, how this issue was missed in
testing, or what changes if any they think they should make to their incident
response process. At the very least I'd expect to see some remediation around
their during-outage communication process since it has pretty universally been
considered to be poor.

It's not the worst post-mortem I've read, but they missed a few chances to
reassure customers.

~~~
RyanGWU82
This post was just an incident review for a technology audience. Dropbox
posted a separate apology to their users on their main blog:
[https://blog.dropbox.com/2014/01/back-up-and-
running/](https://blog.dropbox.com/2014/01/back-up-and-running/) . The tone
and detail seem totally appropriate since it ran concurrently with the other
post.

~~~
imbriaco
Ah, that's interesting. I wasn't aware of this post at all, thanks for
pointing it out.

------
eli
I thought Percona's Xtrabackup already supported parallelized recovery using
binary logs: [http://www.percona.com/doc/percona-
xtrabackup/2.1/](http://www.percona.com/doc/percona-xtrabackup/2.1/) (Though
I'll admit I've never tried to do so)

~~~
morgo
This is the same problem as to why slaves were single-threaded for so long:

Statement based replication (default) is tricky to apply in parallel, since
you can't easily figure out ordering dependencies.

In MySQL 5.6 replication is now parallel per-schema, and in MySQL 5.7 it will
be parallel intra-schema.

------
peterwwillis
They completely missed the real lessons from this outage: Automation is way
more fallible than human beings, and they didn't follow basic best practices
to stage and test production maintenance.

Humans have a lot going for them. They can think continuously and dynamically.
They can change their instructions at a whim. They can provide custom
solutions immediately. And they aren't limited to one way to solve a problem.

When you have to perform a bunch of complicated changes in bulk, you might
think automating it would be the best way to ensure a uniform delivery of your
changes. But when a single thing is different about one environment,
everything is fucked. The only way to ensure a bunch of sensitive changes go
off without a hitch is to make it a manual process, even if you have to
supplement it with some automated processes along the way.

In this case, Dropbox allowed their site's reliability to be dictated by the
automated maintenance of production servers. It's always dangerous to make
changes on a production server. But what makes it worse here is that they
relied on a script to make sure everything happened perfectly, and didn't
double-check the results before putting it back into production.

They didn't even back up the old data in case they needed to quickly revert,
which should be a basic requirement of any production change! This isn't even
disaster recovery, this is production deployment 101. How they allowed this
upgrade to affect the production site is just crazy to me.

------
weisser
> The service was back up and running about three hours later, with core
> service fully restored by 4:40 PM PT on Sunday.

I understand these things happen but I didn't have anything working at all
until Sunday EST. I'm just happy it's back.

Having said that, what do you use to backup your Dropbox? I recently signed up
for Bitcasa but my Dropbox folder had not been fully uploaded by the time
Dropbox stopped worked.

------
nodesocket
[http://status.dropbox.com](http://status.dropbox.com) should really be
better, and be the primary portal for updates. We use StatusPage.io which is
awesome [http://status.commando.io](http://status.commando.io).

------
enscr
Couldn't upload files for more than a day. It was stuck at "connecting". I
wish they had notified on twitter/blog upfront that they are still working on
it.

~~~
jackgavigan
About ten years ago, I had a minor epiphany while working on a system that
used a standard client/service architecture. We were having "stability issues"
with the back-end (trans: it was going down more often than a two dollar
hooker). If the back-end service didn't responding, the GUI just hung; there
was no indication for the end user that there was a problem - it just hung
until it timed out (and because of the nature of the system, the timeout was
relatively lengthy).

This caused a certain amount of consternation amongst our users (because they
were losing money as a result), who would far prefer to be told that the
system wasn't working than to have to wait for a timeout. So I got the GUI
developers to add in a little bit of logic along the lines of:

    
    
      if the response received from the back-end service starts with the string "ERROR_MESSAGE"
      then parse and display the ensuing text to the user
      else continue as normal
    

We then knocked up a simple little service (using netcat, if I recall
correctly) that squirted the contents of a text file back in response to any
request, and echo >>'d the username supplied with the request into a text
file. From then on, if the back-end service had a problem, we would redirect
the load balancers to point to the mini-service so that any users would get a
nice friendly error message (that we could edit/update) telling them that
there was a problem.

When the outage was over and we'd confirmed everything was working properly
again, we'd would simply re-point the load-balancers back to the actual
service IPs/ports, grab the file with all the usernames and email/call them to
let them know that the service was back up. (We had planned to build more
logic into the GUI and add an "Alert me when the system is back up" button to
the error dialog, which would cause the GUI to automatically/silently re-try
every X minutes and alert the user with a pop-up when full seervice was
restored but we sorted out the stability problems before we got a chance to
implement that).

Maybe Dropbox should implement something similar so that, instead of being
stuck at "connecting", users get an error message.

------
neals
Worked throughout the weekend, avid Dropbox user, didn't know they had a
problem up until now.

------
kostyk
It is good they share it with all.

~~~
yachtintransit
I completely disagree.

~~~
chris_wot
Why?

------
yachtintransit
agree , give the story . what was the command in the script that failed. to
error is human to blog honestly about it is a story I want to read. there is
no room for fear in good content , show the true story.

