What if you had finished up a document at home, thought it synced but it never did, and then showed up at a clients a few hours later wondering why you still had the old version on your phone/laptop?
Wouldn't you have wanted DropBox to let you know something was going on?
This is such a hard thing to balance. I hate nagging notifications. Dropbox makes it really easy to see what the service status is. I glance at my menu bar/system tray icon, and look for the green check. If I don't see the green check, I know my docs haven't synced.
Green check - good to go!
Blue cycle - syncing
Blank icon - no connection
Not wanting a nag is not rose tinted glasses. If there is an improvement to be made, it would be in the last icon. Blank doesn't exactly scream "we're down". If the Dropbox client can't get a connection to the service, but it can see that a network link is available, it should give some indication that it is not connected, like some manner of red indicator.
Glad I read HN otherwise I don't know how I would have come across this information. =)
I thought their response struck the appropriate level of detail. I don't care to know the inner workings of their processes, but I'd like some indication that they care and that they're working on it. I got that from this.
A few of the things that jumped out at me after one reading:
1. The apology is the next to last sentence. That's burying the lede. I'd like to see that far earlier, in the first two to three sentences.
2. The tone is overly clinical and lacks humanity. I suspect they felt that it made them sound more authoritative and in control, but instead it comes off somewhat robotic.
3. There's a mixture of too little and too much technical detail. It feels like they couldn't decide who the audience was. There were technical tidbits thrown out without any elaboration that lead to more questions than answers.
4. The remediations sound pretty weak. There's no discussion of the human factors like how the recovery process went, how this issue was missed in testing, or what changes if any they think they should make to their incident response process. At the very least I'd expect to see some remediation around their during-outage communication process since it has pretty universally been considered to be poor.
It's not the worst post-mortem I've read, but they missed a few chances to reassure customers.
I will think no less of any company with solid technology that experiences a failure and puts an honest effort in communicating a post mortum explanation which is exactly what happened here.
I will, though, lose some respect for people who quibble about perceived faux pas of the explanation because it's losing sight of what is actually important.
Not every company is into that whiney startup blood and tears thing. Those "we() worked non-stop for the last 72 hours" often sound a bit desperate.
() And by "we", the PR people usually mean the engineers.
I was more wondering how the mechanics of their incident response processes were managed and whether they planned to make any changes as a result of the review of this incident. Technical remediations are all well and good, but organizational, cultural, and even procedural changes are often even more impactful after events like this.
Were they happy with the pace of communication during the outage? Do they think customers were updated frequently enough, too frequently, etc. Any changes planned?
How did they handle incident fatigue? Did they have to go to shifts to manage the recovery? Did they already have this planned or was it done on the fly? Do they plan to build any procedures to handle similar long-running events in the future?
If I wanted a credit, or SLAs weren't met, then I'd talk directly to an account manager.
What would be an AWSOME idea is if Dropbox did a meetup to go through the gory details for us nerds. Now that would rock.
Kudos the the Dropbox team for working through the weekend fixing stuff. I spent the better part of the weekend nursing a barely 2 year old dying Apple 27" Cinema Display back to life by disassembling it several times. Kept thinking to myself that I sure as hell was glad it wasn't me over at Dropbox HQ working on doing recovery instead.
Edit: I agree with your plea for emotion in the post. It could ease things a bit.
Having written more than my fair share of these, I definitely understand the difficulty involved in choosing your audience and writing to them. That's a big part of the problem here: The audience is not clear. It bounces between technical detail like MySQL recovery process, but it doesn't go deep enough to be satisfying for a really technical audience while being too detailed for a non-technical one.
I have nothing but admiration for their team and the service they've built, but this post-mortem misses the mark.
Bingo. We need nerd updates.
BTW, we deserve this because enough of use use Dropbox for quite important things coding-wise.
Now, I'm not saying that I don't expect companies to be forthright and take ownership of their mistakes, as well as apologize for them, but I can't help but feeling that expecting Dropbox and others to get on their knees and kiss their users' toes when something happens is a little melodramatic. On the one hand, yes, they made a mistake - on the other, we all know that technology is flawed, and these things happen, albeit rarely.
TL;DR: Let's not make a drama out of it.
They decided that it was worth apologizing for near the end of the post. All I'm suggesting is that moving that up near the top and acknowledging up front that they let customers down would have improved the outcome.
They don't need to be over the top about it, just don't bury it at the end of the post.
I hasten to add that I'm not looking for finger-pointing or blame. In retrospectives, I think it's always best to assume that individuals did the best with what they had.  But I think it'd be great if Dropbox asked themselves things like "How did we miss this bug?" and "How could have we discovered this recovery issue before it was on the critical path for a public outage?" Questions like that help you solve not just this bug, but all the related latent bugs that you got the same way you got the one that just blew up.
 A lesson I learned from Norm Kerth: http://www.retrospectives.com/pages/retroPrimeDirective.html
I am pretty sure they would have done that - just that they did not include in the post mortem.
Statement based replication (default) is tricky to apply in parallel, since you can't easily figure out ordering dependencies.
In MySQL 5.6 replication is now parallel per-schema, and in MySQL 5.7 it will be parallel intra-schema.
Humans have a lot going for them. They can think continuously and dynamically. They can change their instructions at a whim. They can provide custom solutions immediately. And they aren't limited to one way to solve a problem.
When you have to perform a bunch of complicated changes in bulk, you might think automating it would be the best way to ensure a uniform delivery of your changes. But when a single thing is different about one environment, everything is fucked. The only way to ensure a bunch of sensitive changes go off without a hitch is to make it a manual process, even if you have to supplement it with some automated processes along the way.
In this case, Dropbox allowed their site's reliability to be dictated by the automated maintenance of production servers. It's always dangerous to make changes on a production server. But what makes it worse here is that they relied on a script to make sure everything happened perfectly, and didn't double-check the results before putting it back into production.
They didn't even back up the old data in case they needed to quickly revert, which should be a basic requirement of any production change! This isn't even disaster recovery, this is production deployment 101. How they allowed this upgrade to affect the production site is just crazy to me.
I understand these things happen but I didn't have anything working at all until Sunday EST. I'm just happy it's back.
Having said that, what do you use to backup your Dropbox? I recently signed up for Bitcasa but my Dropbox folder had not been fully uploaded by the time Dropbox stopped worked.
This caused a certain amount of consternation amongst our users (because they were losing money as a result), who would far prefer to be told that the system wasn't working than to have to wait for a timeout. So I got the GUI developers to add in a little bit of logic along the lines of:
if the response received from the back-end service starts with the string "ERROR_MESSAGE"
then parse and display the ensuing text to the user
else continue as normal
When the outage was over and we'd confirmed everything was working properly again, we'd would simply re-point the load-balancers back to the actual service IPs/ports, grab the file with all the usernames and email/call them to let them know that the service was back up. (We had planned to build more logic into the GUI and add an "Alert me when the system is back up" button to the error dialog, which would cause the GUI to automatically/silently re-try every X minutes and alert the user with a pop-up when full seervice was restored but we sorted out the stability problems before we got a chance to implement that).
Maybe Dropbox should implement something similar so that, instead of being stuck at "connecting", users get an error message.