GitLab Database Incident Report (docs.google.com)
This is painful to read. It's easy to say that they they should have tested their backups better, and so on, but there is another lesson here, one that's far more important and easily missed.

When doing something really critical (such as playing with the master database late at night) ALWAYS work with a checklist. Write down WHAT you are going to do, and if possible, talk to a coworker about it so you can vocalize the steps. If there is no coworker, talk to your rubber ducky or stapler on your desk. This will help you catch mistakes. Then when the entire plan looks sensible, go through the steps one by one. Don't deviate from the plan. Don't get distracted and start switching between terminal windows. While making the checklist ask yourself if what you're doing is A) absolutely necessary and B) risks making things worse. Even when the angry emails are piling you can't allow that pressure to cloud your judgment.

Every startup has moments when last-minute panic-patching of a critical part of the server infrastructure is needed, but if you use a checklist you're not likely to mess up badly, even when tired.

Yup, it's never the fault of a person, always of the system. Once we get this resolved we'll definitely look at ways to prevent anything like it in the future.

23:00-ish

YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com

Good lesson on the risks of working on a live production system late at night when you're tired and/or frustrated.

Also, as a safety net, sometimes you don't need to run `rm -rf` (a command which should always be prefaced with 5 minutes of contemplation on a production system). In this case, `rmdir` would have been much safer, as it errors on non-empty directories.

Or use `mv x x.bak` when `rmdir` fails

These days, I've been very implicit in how I run rm. To the extent that I don't do rm -rf or rmdir (edit: immediately), but in separate lines as something like:

  pushd dir ; find . -type f -ls | less ; find . -type f -exec rm '{}' \; ; popd ; rm -rf dir
It takes a lot longer to do, but I've seen and made enough mistakes over the years that the forced extra time spent feels necessary. It's worked pretty well so far -- knock knock.

Good lesson on making command prompts on machines always tell you exactly what machine you're working on.

I like to color code my terminal. Production systems are always red. Dev are blue/green. Staging is yellow.

All of my non-production machines have emojis in PS1 somewhere. It sounds ridiculous, but I know that if I see a cheeseburger or a burrito I'm not about to completely mess everything up. Silly terminal = silly data that I can obliterate.

I've been color-coding my PS1 for years, but this is seriously brilliant, thanks!

In this case it looks like it has been a confusion between two different replicated Production databases. So this would not have helped.

I use iterm2's "badging" to set a large text badge on the terminal of the name of the system as part of my SSH-into-ec2-systems alias:

    i2-badge ()
    {
      printf "\e]1337;SetBadgeFormat=%s\a" $(echo -n "$1" | base64)
    }
It's not quite as good as having a separate terminal theme, but then I haven't been able to use that feature properly. :(

Yep, good idea. The same thing has been suggested by team members http://imgur.com/a/TPt7O

Is a really good idea, and is one of the improvements that are likely to be put in place as soon as possible. Its already listed on the document.

I do this too, but in this case both machines were production, so this alone would not have sufficed. The system-default prompts on the other hand are universally garbage.

How do you go about colour coding your terminal?

I assume he color coded the prompt. You can use ANSI color escape codes in there to e.g. color your hostname.

Here's a generator for Bash: http://bashrcgenerator.com/, the prompt's format string is stored in the $PS1 variable.

How exactly do you color code it?

This doesn't really help if there are multiple production databases. It could be sharded, replicated, multi-tenant, etc.

Why would it matter? In my last job we had user home directories synced via puppet (I am overly simplifying this) which enabled any ops guy to have same set of shell and vim configuration settings on production machines too.

I daresay - having hostname as part of prompt saves lot of trouble.

Also a good lesson for testing your availability and disaster recovery measures for effectiveness.

Far, far too many companies get production going and then just check to see that certain things "completed successfully" or didn't throw an overt alert in terms of their safety nets.

Just because things seem to be working doesn't mean they are or that they are working in a way that is recoverable.

Seems like very basic mistakes were made, not at the event but way long before. If you don't test to restore your backups, you don't have a backup. How does it go unnoticed that S3 backups don't work for so long?

Yeah, the "You don't have backups unless you can restore them" stikes again.

Virtually the only way to lose data is to not have backups. We live in such fancy times that there's no reason to ever lose data that you care about.

> How does it go unnoticed that S3 backups don't work for so long?

My uneducated guess (this one hit a friend of mine): expired/revoked AWS credentials combined with a backup script that doesn't exit(1) on failure and just writes the exception trace to stderr.

reply


> Our backups to S3 apparently don’t work either: the bucket is empty

followed by

> So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.

is no way to be running a public service with paying customers. Did the person who set up that S3 job simply write a script or something and just go "yep, it's done" and walk away? Seriously?

Amazingly transparent and honest.

Unfortunately, this kind of situation, "only the ideal case ever worked at all", is not uncommon. I've seen it before ... when doing things the right way, dotting 'I's and crossing 'T's, requires an experienced employee a good week or two, it's very tempting for a lean startup to bang out something that seems to work in a couple days and move on.

Backups sucked for the starting in 8.15 on our instances of GLE, because someone decided to add "readable" date stamp in addition to unix timestamp in backup file name without proper testing, which caused many issues. It was somewhat fixed, but I do still issues in 8.16.

I'm not complaining, but backup/restore is important part, with 100% test coverage and daily backup/restore runs.

Start at:

> At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden.

reply


Not sure if the doc here is refreshing or scary. But Godspeed GitLab team. I've loved the product for about two years now, so curious to see how this plays out.

It's both.

I very much appreciate their forthrightness and the way they conduct their company generally. Having said that, I have the code I work on, related content, and a number of clients on the service.

[edit for additional point]

They need the infrastructure guy they've been looking for sooner than later. I hope there's good progress on that front.

We've hired some great new people recently but as you can see there is still a lot work to do. https://about.gitlab.com/jobs/production-engineer/

"So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place."

Does this mean whatever was in that database is gone, with no available backups?

Is this an SOA where important data might lie in another service or data store, or is this a monolithic app and DB that is responsible for many (or all) things?

What was stored in that database? Does this affect user data? Code?

The doc says that there is a LVM snapshot being 6 hours old. <strike>And there should be a regular logical backup with at most 24 hours age as well (they just can't find it for whatever reason).</strike> (Scratch that, my doc did not update, despite Google saying it should automatically update).

Regarding what's gone: The production PostgreSQL database. This suggests that the code itself is fine, but the mappings to the users are gone. But git is a distributed VCS after all, so all the code should be on the developer's machines as well.

We have snapshots, but they're not very recent (see the document for more info). The most recent snapshot is roughly 6 hours old (relative to the data loss). The data loss only affects database data, Git repositories and Wikis still exist (though they are fairly useless without a corresponding project).

Best of luck with the recovery! I know this must be stressful. :(


If you haven't tested your backups, you don't have backups.

Are 'YP' the initials of an employee or is this an acronym I don't know?

reply


Yes, those are the initials of an employee here. Sorry for the confusion!

As much as I appreciate GitLabs extreme openness, that's maybe something that by policy shouldn't be part of published reports. Internal process is one thing, if something goes really bad customers might not be so good at "blameless postmortems" if they have a name to blame.

That is why we went with initials. And I hope customers understand the blame is with all of us, starting with me. Not with the person in the arena. https://twitter.com/sytses/status/826598260831842308

Is your username a Spin reference?

Haha, it wasn't intentional. I'm just a space nerd. That book ranks pretty highly on my list of things every space nerd should read though.

I quite enjoyed it! Also, +1 for space. And Greek.

I think it's a staff member. Can't remember first name, Yuri maybe, who is fairly active with the project.

Nope, that would be me.

Tough night dude. I'll buy you a drink or three if you're ever in Sydney...

Alas, poor Yorick!

Sorry for the rough night Yorick. This could happen to all of us but of course it happens to the person that is working the hardest. <3

Thanks for the transparency. Doesn't always feel good to have missteps aired in public, but it makes us all a little better as a community to be clear about where mistakes can be made.

I'll pour one out for you next time I go out.

From looking at the context of the way YP is referenced (link to a slack archive), I believe YP is an employee.

If you haven't done so recently, TEST YOUR BACKUPS.

This is the stuff my nightmares consist of after 900 consecutive days of being on call (and counting).

Are you a one man team or...? My wife would probably leave me if I was on-call for that long.

I noticed the issue when I was pushing code earlier today. Hopefully this gets resolved soon. You guys are doing a great job. Keep up the good work!

Thanks, not feeling great about the job we're doing today, but we'll learn from this.

And we're sorry for the inconvenience this caused to your workflow today!

