
Gitlab accidentally deleted production data - aeharding
https://twitter.com/gitlabstatus/status/826591961444384768
======
mongoosled
They've released a doc of the recovery effort:
[https://docs.google.com/document/d/1GCK53YDcBWQveod9kfzW-
VCx...](https://docs.google.com/document/d/1GCK53YDcBWQveod9kfzW-
VCxIABGiryG7_z_6jHdVik/pub)

This is the best communicated outage that I can remember seeing.

------
tmnvix
2017/01/31 23:00-ish:

a) YP thinks that perhaps pg_basebackup is being super pedantic about there
being an empty data directory, decides to remove the directory. After a second
or two he notices he ran it on db1.cluster.gitlab.com, instead of
db2.cluster.gitlab.com

b) YP terminates the removal, but it’s too late. Of around 300 GB only about
4.5 GB is left

Nightmare material. Are there any known cases of DBAs or developers suffering
a heart attack due to this sort of thing?

------
robalfonso
Backups every 24 hours? I would have expected hourly. This is a experience
thing, when doing things with data (like wiping db directories) I mv a
directory rather than rm it. If you are working on a problem for hours you are
likely to make a simple critical mistake like that which then creates a whole
new problem to solve.

Best of luck though and kudos for the transparency!

~~~
tqkxzugoaupvwqr
I’m surprised they never checked whether their backups even exist. “Oops, the
S3 bucket is empty!”.

------
BrailleHunting
Statistically since ye olden times, losing data is an LD50 event within 6
months.

Always test yur backups automatically and fail the backup job if the restore
fails, because DR/BCP is no guden without viable customer data.

------
coupdejarnac
Bad timing.. I need to push some code to a client.

------
aeharding
Best of luck to the team in recovery!

