Hacker News new | comments | show | ask | jobs | submit login
Outage post-mortem (dropbox.com)
143 points by goronbjorn 1288 days ago | hide | past | web | 41 comments | favorite



The great thing about DropBox is that I didn't even notice, despite using my files across the outage.


Is that really a good thing?

What if you had finished up a document at home, thought it synced but it never did, and then showed up at a clients a few hours later wondering why you still had the old version on your phone/laptop?

Wouldn't you have wanted DropBox to let you know something was going on?


> Wouldn't you have wanted DropBox to let you know something was going on?

This is such a hard thing to balance. I hate nagging notifications. Dropbox makes it really easy to see what the service status is. I glance at my menu bar/system tray icon, and look for the green check. If I don't see the green check, I know my docs haven't synced.


This is a glaring case of rose tinted glasses! Of course you'd want to know you were working on out of date src/document/etc...


I already know that. The Dropbox icon is always within sight and provides immediate feedback on the status of your Dropbox:

Green check - good to go! Blue cycle - syncing Blank icon - no connection

Not wanting a nag is not rose tinted glasses. If there is an improvement to be made, it would be in the last icon. Blank doesn't exactly scream "we're down". If the Dropbox client can't get a connection to the service, but it can see that a network link is available, it should give some indication that it is not connected, like some manner of red indicator.


It refused to run for me insisting that I relink my machine, and then taking me to the website to do so where that then gave an error. It was only this morning that I finally managed to get things running again.


I found it hard to work out where to get the most up to date information on the outage. I checked the blog, but the last message was their New Year message. In the app and on the main website (mobile version) I couldn't see anything...

Glad I read HN otherwise I don't know how I would have come across this information. =)


Would be nice to have a service who stepped in when shit like this happened. I'd pay good money to have a tiger team appear out of thin air when the shit hit the fan.


I just meant to say that, it would be great if their mobile device application had some way of notifying users of an issue. Rather than just have a generic error when you try to upload files... it would have been good if the app had a banner that said they were experiencing issues and normal service would resume in x days... something.


I mean customer facing.


Meaning a public relations function? That's what this boils down to, ultimately; a person(s) who know the audience, understands what concerns and questions they have and provides timely answers to them.

I thought their response struck the appropriate level of detail. I don't care to know the inner workings of their processes, but I'd like some indication that they care and that they're working on it. I got that from this.


This is, sadly, not a great post-mortem. They missed an opportunity for goodwill. I don't feel more confident in their level of understanding or ability to remediate the problems that led to it after having read it. I know they have an excellent engineering and operations staff -- this post-mortem doesn't reinforce that, though.

A few of the things that jumped out at me after one reading:

1. The apology is the next to last sentence. That's burying the lede. I'd like to see that far earlier, in the first two to three sentences.

2. The tone is overly clinical and lacks humanity. I suspect they felt that it made them sound more authoritative and in control, but instead it comes off somewhat robotic.

3. There's a mixture of too little and too much technical detail. It feels like they couldn't decide who the audience was. There were technical tidbits thrown out without any elaboration that lead to more questions than answers.

4. The remediations sound pretty weak. There's no discussion of the human factors like how the recovery process went, how this issue was missed in testing, or what changes if any they think they should make to their incident response process. At the very least I'd expect to see some remediation around their during-outage communication process since it has pretty universally been considered to be poor.

It's not the worst post-mortem I've read, but they missed a few chances to reassure customers.


This post was just an incident review for a technology audience. Dropbox posted a separate apology to their users on their main blog: https://blog.dropbox.com/2014/01/back-up-and-running/ . The tone and detail seem totally appropriate since it ran concurrently with the other post.


Ah, that's interesting. I wasn't aware of this post at all, thanks for pointing it out.


This is a silly fetishizing of 'post mortum'

I will think no less of any company with solid technology that experiences a failure and puts an honest effort in communicating a post mortum explanation which is exactly what happened here.

I will, though, lose some respect for people who quibble about perceived faux pas of the explanation because it's losing sight of what is actually important.


> There's no discussion of the human factors like how the recovery process went, how this issue was missed in testing, or what changes if any they think they should make to their incident response process.

Not every company is into that whiney startup blood and tears thing. Those "we() worked non-stop for the last 72 hours" often sound a bit desperate.

() And by "we", the PR people usually mean the engineers.


That's not at all what I was getting at. It's not about patting yourself on the back or trying to make the team look like heroes.

I was more wondering how the mechanics of their incident response processes were managed and whether they planned to make any changes as a result of the review of this incident. Technical remediations are all well and good, but organizational, cultural, and even procedural changes are often even more impactful after events like this.

For example:

Were they happy with the pace of communication during the outage? Do they think customers were updated frequently enough, too frequently, etc. Any changes planned?

How did they handle incident fatigue? Did they have to go to shifts to manage the recovery? Did they already have this planned or was it done on the fly? Do they plan to build any procedures to handle similar long-running events in the future?


I don't see the need for any of that. What does it really matter they had "incident fatigue"? I don't really care about their internal comms or escalation procedures. If I was a customer, I'd want to know what they are doing to mitigate a similar incident (which they answered), and an apology.

If I wanted a credit, or SLAs weren't met, then I'd talk directly to an account manager.


As I've said before, one blog post does not represent that team when it's written by someone tasked with the job of communicating with a wide variety of customers. My mom could give two hoots about details. She wants to know why her 'spinny drobox thing' keep spinning and should she upgrade or something. I deliver that news to her. This blog post delivers it to people who don't understand as well as most of us but better than my mom.

What would be an AWSOME idea is if Dropbox did a meetup to go through the gory details for us nerds. Now that would rock.

Kudos the the Dropbox team for working through the weekend fixing stuff. I spent the better part of the weekend nursing a barely 2 year old dying Apple 27" Cinema Display back to life by disassembling it several times. Kept thinking to myself that I sure as hell was glad it wasn't me over at Dropbox HQ working on doing recovery instead.

Edit: I agree with your plea for emotion in the post. It could ease things a bit.


That's just it, though. This is the public face of the team that responded to that outage. It absolutely represents them. Now, whether it's a fair depiction or not is definitely a valid question.

Having written more than my fair share of these, I definitely understand the difficulty involved in choosing your audience and writing to them. That's a big part of the problem here: The audience is not clear. It bounces between technical detail like MySQL recovery process, but it doesn't go deep enough to be satisfying for a really technical audience while being too detailed for a non-technical one.

I have nothing but admiration for their team and the service they've built, but this post-mortem misses the mark.


> the audience is not clear.

Bingo. We need nerd updates.

BTW, we deserve this because enough of use use Dropbox for quite important things coding-wise.


I agree with many of your points and appreciate your technical assessment of the actual post-mortem aspect, but your first comment seems particularly nit picky. It's a growing trend that when a company or person fucks up, we expect a big, grandiose, sobbing apology (and when they don't, we blow a gasket - a la Snapchat).

Now, I'm not saying that I don't expect companies to be forthright and take ownership of their mistakes, as well as apologize for them, but I can't help but feeling that expecting Dropbox and others to get on their knees and kiss their users' toes when something happens is a little melodramatic. On the one hand, yes, they made a mistake - on the other, we all know that technology is flawed, and these things happen, albeit rarely.

TL;DR: Let's not make a drama out of it.


I'm admittedly being nit-picky because I feel very strongly about the importance of outage communication. Good communication both during and after an incident can make a tremendous amount of difference in how you are perceived.

They decided that it was worth apologizing for near the end of the post. All I'm suggesting is that moving that up near the top and acknowledging up front that they let customers down would have improved the outcome.

They don't need to be over the top about it, just don't bury it at the end of the post.


Fair enough. Your comment was more of a spark of a sentiment I've been carrying around for a little while. I can't agree enough that proper outage communication is important.


Google provided a great Incident Report / Postmortem when they had their API infrastructure outage back in May 2013. I created a screencast about how their template should be used as a model for the rest of us to follow. You can watch the screencast @ http://sysadmincasts.com/episodes/20-how-to-write-an-inciden...


The thing I always look for in post-mortems is an understanding of the failure of human systems. The technical failures are interesting, but it is the human systems that produced the technical failures. And will keep on producing other failures unless changed.

I hasten to add that I'm not looking for finger-pointing or blame. In retrospectives, I think it's always best to assume that individuals did the best with what they had. [1] But I think it'd be great if Dropbox asked themselves things like "How did we miss this bug?" and "How could have we discovered this recovery issue before it was on the critical path for a public outage?" Questions like that help you solve not just this bug, but all the related latent bugs that you got the same way you got the one that just blew up.

[1] A lesson I learned from Norm Kerth: http://www.retrospectives.com/pages/retroPrimeDirective.html


"How did we miss this bug?" and "How could have we discovered this recovery issue before it was on the critical path for a public outage?"

I am pretty sure they would have done that - just that they did not include in the post mortem.


The first was answered in the postmortem. The second is something done in time - either it's hard to answer in detail without revealing confidential information, or they are working towards it in the medium term.


is this satire?


I thought Percona's Xtrabackup already supported parallelized recovery using binary logs: http://www.percona.com/doc/percona-xtrabackup/2.1/ (Though I'll admit I've never tried to do so)


This is the same problem as to why slaves were single-threaded for so long:

Statement based replication (default) is tricky to apply in parallel, since you can't easily figure out ordering dependencies.

In MySQL 5.6 replication is now parallel per-schema, and in MySQL 5.7 it will be parallel intra-schema.


They completely missed the real lessons from this outage: Automation is way more fallible than human beings, and they didn't follow basic best practices to stage and test production maintenance.

Humans have a lot going for them. They can think continuously and dynamically. They can change their instructions at a whim. They can provide custom solutions immediately. And they aren't limited to one way to solve a problem.

When you have to perform a bunch of complicated changes in bulk, you might think automating it would be the best way to ensure a uniform delivery of your changes. But when a single thing is different about one environment, everything is fucked. The only way to ensure a bunch of sensitive changes go off without a hitch is to make it a manual process, even if you have to supplement it with some automated processes along the way.

In this case, Dropbox allowed their site's reliability to be dictated by the automated maintenance of production servers. It's always dangerous to make changes on a production server. But what makes it worse here is that they relied on a script to make sure everything happened perfectly, and didn't double-check the results before putting it back into production.

They didn't even back up the old data in case they needed to quickly revert, which should be a basic requirement of any production change! This isn't even disaster recovery, this is production deployment 101. How they allowed this upgrade to affect the production site is just crazy to me.


> The service was back up and running about three hours later, with core service fully restored by 4:40 PM PT on Sunday.

I understand these things happen but I didn't have anything working at all until Sunday EST. I'm just happy it's back.

Having said that, what do you use to backup your Dropbox? I recently signed up for Bitcasa but my Dropbox folder had not been fully uploaded by the time Dropbox stopped worked.


http://status.dropbox.com should really be better, and be the primary portal for updates. We use StatusPage.io which is awesome http://status.commando.io.


Couldn't upload files for more than a day. It was stuck at "connecting". I wish they had notified on twitter/blog upfront that they are still working on it.


About ten years ago, I had a minor epiphany while working on a system that used a standard client/service architecture. We were having "stability issues" with the back-end (trans: it was going down more often than a two dollar hooker). If the back-end service didn't responding, the GUI just hung; there was no indication for the end user that there was a problem - it just hung until it timed out (and because of the nature of the system, the timeout was relatively lengthy).

This caused a certain amount of consternation amongst our users (because they were losing money as a result), who would far prefer to be told that the system wasn't working than to have to wait for a timeout. So I got the GUI developers to add in a little bit of logic along the lines of:

  if the response received from the back-end service starts with the string "ERROR_MESSAGE"
  then parse and display the ensuing text to the user
  else continue as normal
We then knocked up a simple little service (using netcat, if I recall correctly) that squirted the contents of a text file back in response to any request, and echo >>'d the username supplied with the request into a text file. From then on, if the back-end service had a problem, we would redirect the load balancers to point to the mini-service so that any users would get a nice friendly error message (that we could edit/update) telling them that there was a problem.

When the outage was over and we'd confirmed everything was working properly again, we'd would simply re-point the load-balancers back to the actual service IPs/ports, grab the file with all the usernames and email/call them to let them know that the service was back up. (We had planned to build more logic into the GUI and add an "Alert me when the system is back up" button to the error dialog, which would cause the GUI to automatically/silently re-try every X minutes and alert the user with a pop-up when full seervice was restored but we sorted out the stability problems before we got a chance to implement that).

Maybe Dropbox should implement something similar so that, instead of being stuck at "connecting", users get an error message.


Worked throughout the weekend, avid Dropbox user, didn't know they had a problem up until now.


It is good they share it with all.


I completely disagree.


Why?


agree , give the story . what was the command in the script that failed. to error is human to blog honestly about it is a story I want to read. there is no room for fear in good content , show the true story.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: