
'Check Your Backups Work' Day - ParadisoShlee
http://checkyourbackups.work
======
soheil
Imagine if you're the CEO of Gitlab and seeing this right now. I think showing
compassion and solidarity may be a better response. The problem of restoring
non-existing backups should be treated as a more serious problem in our
industry. This happens too often and not because people who made them were
careless people but because to catch any errors when the backups are not
working can take unreasonable amount of time and even then it could still just
stop working one day.

We should probably treat this issue as something more like a disease like high
blood pressure, namely, you don't know you have it, but it is probably doing
irreparable damage to your internal organs. If we had no name for the disease
or understanding of it we would just die at an earlier age without an obvious
cause.

Let's identify the diseases of this sort in our industry and work on
prevention, diagnosis and treatment of them instead of just saying you should
have been more careful.

~~~
ParadisoShlee
I have nothing, but respect for the GitLab team.... offering live notes for
the recovery is stunning.

~~~
edelans
Agreed ! Even when they screw things up, they make people benefit from it!
Kudos for such a good spirit and dedication to your customers.

Everybody screw up from time to time in our industry. And when this happens,
you have 2 types of guy : those who try to hide it, and those like Giltlab
team who communicate as fast as they can because they respect their customers.
Paradoxically, to me, it creates more trust than it damages it !

~~~
sytse
Thanks for the kind words. I'm sorry for letting our users down. We'll ask the
the 5 why's
[https://en.wikipedia.org/wiki/5_Whys](https://en.wikipedia.org/wiki/5_Whys)
We need to go from the initial mistake (wrong machine, solve by better
hostname display and colors), to the second (not having a recent backup), to
the third (not testing backups), to the fourth (not having a script for backup
restores), to the fifth (nobody in charge of data durability and no written
plan). The solutions above are just guesses at this point, we'll dive into
this in the coming days and will communicate what we will do in a blog post.

~~~
throwaway8124
Good morning (posted from throwaway for reasons Ill describe).

I feel for you greatly here, and I commend your openness about how data
restoration caused 6 hours of data loss. I too work in a critical area where
even minutes DB lost is bad.

We just had our own test event recently. We make sure that we can fail
everything over, and run on all secondaries. I found out how that worked; we
failed. The problem with this, is I found out after the fact. Due to the
secrecy, not even the teams knew why things failed the way they did. I had to
piece it together from disjoint hearsay, and now I believe I have a competent
picture.

So yes, when I read your post mortem and RCA, it reminded me greatly of what
happened here as well. But we all can learn from your example. As for me, I'm
posting it as a throwaway due to likely threats on my job.

~~~
sytse
I agree that 6 hours is way too much.

------
borplk
Not referring to Gitlab's incident but the general problem with this stuff is
too often people treat something like "make a backup" too literally.

As in, they think the act of generating a backup file is the last stage of the
process and they are done with it. Maybe they go the extra mile and throw it
in a cronfile too.

What you have to do is to consider any and all backups non-existent until you
have a complete backup strategy.

In other words you have to appreciate that "having a backup" is a means to an
end and another way of saying "being able to successfully recover from data
loss".

So just generating a file is not sufficient.

You have to complete a successful test restore before you can call it a
backup.

You have to have heart-beat measures in place to make sure you will not be
impacted by a silent failure (example: check $last_successful_backup >= last X
hours).

You have to periodically manually check that your automated checks work
("simulate" backups not being generated and wait for alert).

Far too often people don't appreciate the depth and the weight that a phrase
like "backup plan" carries.

So you may ask them to "take care of backups" they will go run pg_dump and
mysqldump and say "It's done". No goddamn it, it's not done.

~~~
bluGill
There are levels. Obviously the best test is 3 backups on 3 different
continents (in the future one of the backups will be on Mars), with regular
tests that each site can restore from scratch.

Even the minimum level: an untested backup to the same disk is better than
what most people have. That untested backup is there and protects against
accidental file deletion - which will eventually become a test for most
people. If the disk fails it is also something to work with. If you tell a
disaster recovery service that there is a backup they can use that redundancy
to their advantage: odds are physical damage isn't to both the backup and the
read data and they only have to recover one. Even if the damage is to both
there is something to work with.

As we move up the ladder there is more and more. An untested backup has data -
give me the team a few years and we can recover it. We might have to recreate
the restore from scratch, but there is something.

Remember though, the farther down the ladder you are the more expensive
recovery can be. If you have tested backups on 3 continents recovery only
costs a couple hours downtime - an actuary can put an exact dollar cost on
this. If the price is too high you can invest in redundancy on the live
system. If you have an untested backup it might be years before your team can
recover it: millions of dollars in labor to recover the data and sever years
of no/reduced business while they recover it. (hint: the company will go out
of business because they cannot afford to recover the data)

------
shubhamjain
The graver problem is programmers — and, humans in general — don't realise the
gravity of recklessness until the moment when shit hits the fan. Startups
begin with general imprudence towards checks and processes, which is
understandable but even in the growing stages, the idea of incorporating them
is snubbed because of 'priorities'. "Time is precious and there are more
important problems to solve in the way".

Unauthenticated MongoDb on default port? The likelihood of a person port
scanning the entire web just doesn't strike as anything adverse, and then,
someday someone does exactly that. I guess this is one important reason to
bring at least some experienced people on board because there is a likely
chance that they can give a better perspective on seriousness of such issues.

------
kator
In my day we used to develop "Disaster Recovery" programs. They were massive,
and we tested on a regular basis including renting massive systems from IBM
and flying the team to the IBM data center to run a full restore of
everything. End business users had to login afterwards and sign off.

I understand we "live in a different world" is the favorite motto these days.
But do we really? If anything data is bigger, more complex, not in one place,
you can't just ship a truck load of tapes and three people somewhere to test.

IMHO the more we try to reinvent technology the more we realize some of the
things we all felt were weighing us down were actually smart ideas brought on
not by fear but by real life experience.

And the pendulum will swing once more here and back again at some point in the
future.

~~~
Joe8Bit
I don't recognise the 'different world' you describe. Practiced disciplined DR
is a key part of modern software engineering at large and small companies
across the startup/enterprise spectrum (in my experience). The popularity of
tools like Netflix's Chaos Monkey/Gorilla/Kong server as testament to that.

That's not to say all companies do it (it seems GitLab didn't) but the tone of
your comment doesn't reflect a lot of people's experiences.

~~~
acdha
I think the key is knowing that this is routine at a subset of shops while
most others are winging it to various degrees — and always has been.

If you worked at a responsible place decades ago you might be aghast at what
happens at a random sampling of companies now, but the same would have been
true at a random sampling decades ago. The difference is that unless you were
a customer or the outage was especially prominent you probably never would
have heard about it.

------
HarrisonFisk
This is every day at Facebook!

[https://code.facebook.com/posts/1007323976059780/continuous-...](https://code.facebook.com/posts/1007323976059780/continuous-
mysql-backup-validation-restoring-backups/)

~~~
elvinyung
Oh wow, some interesting stuff in here. Looks like they use MySQL as a queue
for scheduling the ORC Peons? Would have loved to hear more about why they did
that.

~~~
hueving
Because it works perfectly fine? I tend to write anyone off who scoffs at
simple database-as-queue designs without understanding what the scaling
requirements are. You can use a database as a job queue for 10s of thousands
of jobs per day without any sweat.

~~~
OJFord
Scoffing? GP said it was "interesting stuff" and that they'd "love to hear
more". What's wrong with that?

------
virtualized
> After a second or two he notices he ran it on db1.cluster.gitlab.com,
> instead of db2.cluster.gitlab.com

Everyone is talking about backups, but why not about this? How is it even
possible to delete the production database by accident? Why does he have SSH
access there? Why do they test their database replication in production? Why
are they fire-fighting by changing random database settings _in production_?

I know that all of this is common practice. I am questioning it in general.

~~~
nadaviv
I would start by giving them more descriptive names... calling it db1 and db2
is a sure way to trip someone some day.

~~~
abricot
This is already on their todo in the same document.

------
ktta
I think having a day for a whole year is a bit sparse. Some startups start and
shutdown within a year. Apart from checking backups after any big code change
related to backups, I think backups should be checked quarterly.

It takes no more than couple hours most of the time, and as our wise said, "An
ounce of prevention is worth a pound of cure"

~~~
paulddraper
I think that once a quarter is better, but if your start up shuts down at the
end of the year, you probably don't need to worry about it

~~~
ktta
I don't think any startup knows that they're going to shutdown within the
year. No one would take the time if they knew they were going to shut down
soon.

~~~
paulddraper
Correct. So save some concerns (e.g. weekly verification of backups) until
after a year.

------
thejosh
March 31st...
[http://www.worldbackupday.com/en/](http://www.worldbackupday.com/en/)

~~~
jlgaddis
Heh, "Don't be an April Fool" reminded me of the old annual "Internet Spring
Cleaning" [0] ritual.

[0]:
[http://www.snopes.com/holidays/aprilfools/cleaning.asp](http://www.snopes.com/holidays/aprilfools/cleaning.asp)

~~~
jeron
I wonder if anyone would fall for it now...

~~~
jk563
Along the same lines as [https://xkcd.com/1053/](https://xkcd.com/1053/), I
believe so.

------
yeukhon
One time at work I accidentally triggered delete on an RDS CloudFormation
stack It was not fun. Automatic backup from AWS is useless because automatic
snapshots are removed as soon as the RDS instance is removed unless you tell
AWS to make a final snapshot. We didn't have that flag in the stack template
at the time, so ugh.

Oh how I deleted the stack? I was using the mobile app and was trying to look
at the status of the Cfn stack, but the app was laggy my finger pressed the
wrong button... sigh The other interesting thing was I checked the status
because the previous night I changed my RDS instance to provisioned IOPS (took
8 hours and failed too). I felt sad and guilty, but at the same time I felt
whatever because the upgrade didn't go through so perhaps this accident was
all meant to be....

~~~
aaronmdjones
Ouch.

Doubly ouch that it seems that there's no confirmation dialog with a 5-second
countdown before you can hit Yes, or whatever.

~~~
PhantomGremlin
For critical stuff IMO there needs to be more than just a confirmation dialog.
The user needs to be transitioned into a totally different state of mind from
the usual click, click, click, click.

E.g. forcing someone to manually type the characters D E L E T E before
allowing deletion of something potentially important.

Either that or everything should have multiple levels of undo. Everything.

~~~
andrewaylett
There's certainly some stuff in AWS that requires you to type the name of the
thing you're deleting into a text field in order to delete it, but I suspect
that's purely a UI check -- there wouldn't be so much point in requiring the
name twice in the underlying API.

So if you've got a dud client implementation then you're going to lose the
check.

One way to do stuff like this is to have separate roles for read-only and
read-write access. I pay a lot more attention to what I'm doing on the rare
occasions I assume permissions to change things.

------
jacquesm
Suggested workflow:

Make a backup, restore to test environment, run checksums, anonymize, release
test environment.

That way each and every backup is tested both for integrity and ability to
rebuild a working environment from it.

In my practice insufficient backup is still (unfortunately) a very common
occurrence.

On another note: just having stuff stored with triple replication in the cloud
is emphatically _NOT_ a backup.

And it also helps if the same people that have access to the live environment
do not have write access to the backups, but that's only feasible past a
certain team size.

------
xtracto
Ooohh something similar happened in our company some time ago:

We had a MongoDB server (with a read-replica) in our production environment
(this was an ec2 instance with MongoDB).

One day, a dev accidentally deleted the main collection in the DB during a
night coding session. Next morning, when we realized that, we went straight to
the daily backups that we have been doing. It happened that for some reason
the backup of the previous 2 or 3 days did not work.

We had to get into Mongo-OpLog (which was on only because of the read-replica)
and reconstruct the missing 3 days from it.

That was fucking scary.

------
muse900
[http://www.commitstrip.com/en/2016/09/05/do-we-have-a-
back-u...](http://www.commitstrip.com/en/2016/09/05/do-we-have-a-back-up-in-
the-audience/)

^ this

------
ssivark
While we're discussing the importance of backups, I would like to pause for a
minute to think about a _common systemic failure model that simply making
backups doesn 't solve_.

I've realized in the past that in day-to-day situations it is more likely to
lose data because of temporary programmer carelessness. Examples: deleting the
backup version of a folder, deleting the copy on the wrong server, etc. This
seems similar to what happened at Gitlab.

How to protect oneself from this failure mode? Can we design a better system
than to _assume_ that every human command is well considered? (Sort of like
guard rails, when purging backups)

~~~
phireal
I've tended to make backups read only so as to minimise the impact of
accidental deletions.

------
onetom
It's exemplary to be so honest.

The CEO should laugh with us a bit and be proud to inspire a day be named
after their hiccup, thanks to their transparency.

I'm totally compassionate but we should never lose our sense of humour!

------
arca_vorago
I would like to take this moment to impart to all of you concerned about your
business backups to make sure you are enabling your systems administrator with
the budget, tools, personnel, and backing from management he/she needs to get
the job done.

I have seen far too often one man miracle teams swimming in technical debt
solving problems constantly but failing to have the time to play the political
games needed to push for the kinds of changes they need. Obviously small
operations setups are different, but for example, I've seen a ~250 person 6
branch business have 1 senior and 1 junior part time sysadmin, while his
requests for a budget and personnel were constantly denied, and so he said his
backup system worked but he knew it wasn't as good as he could make it. He
eventually quit in frustration. He was a great sysadmin but didn't play enough
politics and therefor he failed and his management failed him, all the while
jeapordizing the business. Please don't do this to your sysadmin.

CTOs and CIOs, please take a moment to ask your sysadmin what things they need
they haven't been able to convince you of yet, and see if you can compromise
or otherwise try to lend their arguments importance.

In all but the leanest of SV web startup land, sysadmin are the backbone that
keeps your company running. Don't neglect or forget them.

If you do, one day that backup may fail, or a cryptovariant will hit the
server, and although you will scapegoat your sysadmin, it will have truly been
your fault.

------
rabboRubble
It's one thing to schedule a backup.

Quite another to _test_ the backup's restorability.

Most of us back up. Very few test that the backup works as intended. I need to
do better in the later.

~~~
overcast
Or you know, just confirming the backups actually ran!?

------
saycheese
Really wish "backups" as a term would be replaced with a term that means the
data was copied and proofen to fully accessible. Any ideas what that term
might be and why?

~~~
saycheese
Recoverables?

------
ParadisoShlee
I feel a special need to congratulate and offer a tip of the hat to the GitLab
team for their transparency during this outage. Excellent work!

------
ThatGeoGuy
This actually inspired me to set up my daily / monthly backups, which I had
not done since upgrading to Devuan almost half a year ago. Fortunately, I had
already written a blog post [1] about backups, so setting myself back up and
adding a cron task took under 30 min with a fresh disk.

[1] [https://thatgeoguy.ca/blog/2013/12/26/encrypted-backups-
in-d...](https://thatgeoguy.ca/blog/2013/12/26/encrypted-backups-in-debian/)

------
benmorris
I discovered my Dev server DB backups weren't working after reading this. I
used to consider my dev server just a place where I tossed test code to mimic
production, but forgot that my DB server running on it was the only copy of
potentially months of work! I have it dump nightly to my desktop now which
also gets backed up locally and to the cloud.

------
dbg31415
Too bad "Check your site is using HTTPS" day, or "Check your website meta data
is setup for sharing" day, or "Check your site is legible on mobile" day
weren't first.

Look, shit happens... we don't need to make fun of people for it. We all cut
corners at times... when we are lucky, nobody notices. When we aren't...

------
niceworkbuddy
Here is link to the document mentioned:
[https://docs.google.com/document/d/1GCK53YDcBWQveod9kfzW-
VCx...](https://docs.google.com/document/d/1GCK53YDcBWQveod9kfzW-
VCxIABGiryG7_z_6jHdVik/pub)

------
buzzybee
Would extend this to: "keep your personal data backed up too" \- and for low-
value or low-security data, it's easily done with consumer commercial
services, since they're designed to allow you to be lazy and have no backup
discipline.

------
ttflee
Mean-Time-Between-RM-RF.

------
samdoidge
> So in other words, out of 5 backup/replication techniques deployed, none are
> working reliably or set up in the first place.

------
dom0
With something like Borg where you can just mount your backups and look at
them normally it's fairly easy to see whether they're ok / include what you
wanted.

Of course, backing a whole platform up is more complex, and things like
databases normally require custom scripting (dump -> backup dump, eg.
pg_dumpall | borg create ... -).

------
gtsteve
> Our backups to S3 apparently don't work either: the bucket is empty

Ouch. I can't even imagine how that feels. This is why even despite monitoring
and paging scripts, I still have an event to check my company's backups
weekly. Now I don't feel so paranoid.

------
majkinetor
Gitlabs behavior is a testament to success !

No hidding, no eupheminization, their live doc stream actually made me
question what things did I do on my systems. Looks like convergent evolution
in some parts, like prompt changes.

Thanks Gitlab.

~~~
filipa
From someone who works at GitLab, thank you for your kind words! Our
infrastructure team is working hard!

------
boie0025
This has always been my worst fear. Solidarity to the people at gitlab dealing
with this no doubt incredibly stressful situation. <3

------
timcederman
When stuff "just works", you don't need to check your backups. I fully trust
my iPhone's iCloud backups, my Time Machine backups, and my cloud rsyncs. Time
Machine also lets me know if they get corrupted, or if I haven't backed up in
a while. That's how backups should work - an adage of "you don't have backups
unless you check them" just won't work for most people.

~~~
misternugget
You might want to check your Time Machine backups by hand from time to time
using the tmutil[1] tool.

My Time Machine backups in the past have been missing gigabytes of data,
without me telling anything about it. And not just volatile data, like temp
files or caches, but photos, music and documents.

[1]: [http://osxdaily.com/2012/01/21/compare-time-machine-
backups-...](http://osxdaily.com/2012/01/21/compare-time-machine-backups-
tmutil/)

~~~
overcast
I use Time Machine for quick recovery stuff, and then Arq Backup to send to
remote storage.

------
koolba
"Check your backups work day" is like "Earth day" \- it should be every day.

------
davvolun
Aaaand just found my personal backups at home haven't been running for about a
month.

ALWAYS CHECK !

------
chiefalchemist
Well, I guess that explains my MIA repos earlier today.

------
dogma1138
I check my backups on Feb 29th ;)

------
elvinyung
Typo in the page: should be @gitlab, not @gitlib.

~~~
ParadisoShlee
PR was pulled. Thanks.

------
muse900
Can't someone working for the company I bank with accidentally delete my debt
and its backups? :P

------
Zelmor
That gave me the chuckles. I'll put it in the calendar. Should be a real
thing.

------
korzun
It's nice to see that people still use Microsoft Word for web design.

------
dhimes
Man I thought I was having a bad day. If you get spam from
$fakeUserName@studyswami.com you have my profound apologies. Now I have to
figure out how to mea culpa to Gmail (and everyone else) for the "why the
hate?" protests. Ugh.

