
GitLab Database Incident – Live Report - sbuttgereit
https://docs.google.com/document/d/1GCK53YDcBWQveod9kfzW-VCxIABGiryG7_z_6jHdVik/pub
======
CrLf
The public report is nice and we can see a sequence of mishaps from it, that
shouldn't have been allowed to happen but which (unfortunately) are not that
uncommon. I've done my share of mistakes, I know what's like to be in
emergency mode and too tired to think straight, so I'm going to refrain from
criticizing individual actions.

What I'm going to criticize is the excess of transparency:

You absolutely DO NOT publish postmortems referencing actions by NAMED
individuals, EVER.

From reading the whole report it's clear that the group is at fault, not a
single individual. But most people won't read the whole thing, even less
people will try to understand the whole picture. That's why failures are
always attributed publicly to the whole team, and actions by individuals are
handled internally only.

And they're making it even worse by livestreaming the thing! It's like having
your boss looking over your shoulder but a million times worse...

~~~
YorickPeterse
I myself initially added my name to the document in various parts, this was
later changed to just initials. I specifically told my colleagues it was OK to
keep it in the document. I have no problems taking responsibility for
mistakes, and making sure they don't happen ever again.

~~~
jawilson2
That's awesome, but why publicize it? This isn't an act of contrition for you,
no one outside your team really needs to see your dirty laundry, and actually
comes off as unprofessional to me. The gitlab team is a team, and you take
responsibility as a team. Placing names and initials in the liveblog makes it
look like SOMEONE is trying to assign and pass off blame, even if that is not
what is happening.

Presumably in the coming days there will be a number of team meetings where
you discover what went wrong, and what the action items are for everyone
moving forward. The public looking info just needs to say what went wrong, how
it is being fixed, and what will be done in the future to prevent it from
happening again. I don't need names to get that.

~~~
omouse
On the contrary, it comes off as _very_ professional. All other companies
would hide this, they would show off a very cleaned up post-mortem and say
"problem solved" and that's it. Ok so what does that mean, does it mean the
process will change for the future or that they just fixed it for today?

This is also an awesome advert to see how they work remotely all together and
I'm sure they're hiring for DevOps people now ;)

~~~
throwaway91111
Naming individuals is not professional. Even allowing it with permission does
not set a good standard for operation.

------
DanielDent
I'm a huge Gitlab fan. But I long ago lost faith in their ability to run a
production service at scale.

Nothing important of mine is allowed to live exclusively on Gitlab.com.

It seems like they are just growing too fast for their level of investment in
their production environment.

One of the only reasons I was comfortable using Gitlab.com in the first place
was because I knew I could migrate off it without too much disruption if I
needed to (yay open source!). Which I ended up forced to do on short notice
when their CI system became unusable for people who use their own runners
(overloaded system + an architecture which uses a database as a queue. ouch.).

Which put an end to what seemed like constant performance issues. It was
overdue, and made me sleep well about things like backups :).

A while back one of their database clusters went into split brain mode, which
I could tell as an outsider pretty quickly... but for those on the inside, it
took them a while before they figured it out. My tweet on the subject ended up
helping document when the problem had started.

If they are going to continue offering Gitlab.com I think they need to
seriously invest in their talent. Even with highly skilled folks doing things
efficiently, at some point you just need more people to keep up with all the
things that need to be done. I know it's a hard skillset to recruit for - us
devopish types are both quite costly and quite rare - but I think operating
the service as they do today seriously tarnishes the Gitlab brand.

I don't like writing things like this because I know it can be hard to
hear/demoralizing. But it's genuine feedback that, taken in the kind spirit is
intended, will hopefully be helpful to the Gitlab team.

~~~
throwaway3347
Like you, I would like to add my 2 cent, which I hope will be taken
positively, as I would like to see them provide healthy competition for GitHub
for years to come.

Since GitLab is so transparent about everything, from their
marketing/sales/feature proposals/technical issues/etc., they make it
glaringly obvious, from time to time, that they lack very fundamental core
skills, to do things right/well. In my opinion, they really need to focus on
recruiting top talent, with domain expertise.

They (GitLab) need to convince those that would work for Microsoft or GitHub,
to work for GitLab. With their current hiring strategy, they are getting
capable employees, but they are not getting employees that can help solidify
their place online (gitlab.com) and in Enterprise. The fact that they were so
nonchalant about running bare metal and talking about implementing features,
that they have no basic understanding of, clearly shows the need for better
technical guidance.

They really should focus on creating jobs that pays $200,000+ a year,
regardless of living location, to attract the best talent from around the
world. Getting 3-6 top talent, that can help steer the company in the right
direction, can make all the difference in the long run.

GitLab right now, is building a great company to help address low hanging
fruit problems, but not a team that can truly compete with GitHub, Atlassian,
and Microsoft in the long run. Once the low hanging fruit problems have been
addressed, people are going to expect more from Git hosting and this is where
Atlassian, GitHub, Microsoft and others that have top talent/domain expertise,
will have the advantage.

Let this setback be a vicious reminder that you truly get what you pay for and
that it's not too late to build a better team for the future.

~~~
hueving
Why would they try to recruit from Microsoft? Most of the software engineers
at Microsoft are not focused on developing scalable web services
architectures. And the ones that do have built up all of their expertise with
Microsoft technologies (.net running on Windows server talking to mssql).

>Microsoft and others that have top talent/domain expertise, will have the
advantage.

Again, Microsoft isn't even in this same field (git hosting) or if they are,
are effectively irrelevant due to little market/mindshare. Are you an employee
there or something?

~~~
rrdelaney
One of the main drivers of revenue for Microsoft is Office 365, with 23.1
million subscribers[0]. Along with Azure, MS runs some of the largest web
services around. Most developers at MS don't necessarily work on these
products, but to say that all the devs working on them use a simple .NET stack
+ SQL Server is discrediting a lot of work that they do.

Disclaimer: I work for Microsoft in the Office division and opinions are my
own

[0] [https://www.microsoft.com/en-
us/Investor/earnings/FY-2016-Q4...](https://www.microsoft.com/en-
us/Investor/earnings/FY-2016-Q4/press-release-webcast)

~~~
Zelmor
>I work for Microsoft in the Office division

Hey there, honest question incoming. Any chances of you chaps making Word a
better documentation tool in the future? Edit history storing formatting and
data changes on the same tree is making it impossible to use Word for anything
serious. This really comes to light once you start working at an MS tech
company on documentation, where it is obvious that you should use MS products
for work. Some tech writers I know just end up using separate technology
branches for their group efforts, since neither Sharepoint nor Word is a
professional tool for this job.

------
elementalest
>1\. LVM snapshots are by default only taken once every 24 hours. YP happened
to run one manually about 6 hours prior to the outage

>2\. Regular backups seem to also only be taken once per 24 hours, though YP
has not yet been able to figure out where they are stored. According to JN
these don’t appear to be working, producing files only a few bytes in size.

>3\. SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries
are being run instead of 9.6 binaries. This happens because omnibus only uses
Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not
exist. As a result it defaults to 9.2, failing silently. No SQL dumps were
made as a result. Fog gem may have cleaned out older backups.

>4\. Disk snapshots in Azure are enabled for the NFS server, but not for the
DB servers. The synchronisation process removes webhooks once it has
synchronised data to staging. Unless we can pull these from a regular backup
from the past 24 hours they will be lost The replication procedure is super
fragile, prone to error, relies on a handful of random shell scripts, and is
badly documented

>5\. Our backups to S3 apparently don’t work either: the bucket is empty

>So in other words, out of 5 backup/replication techniques deployed none are
working reliably or set up in the first place.

Sounds like it was only a matter of time before something like this happened.
How could so many systems be not working and no one notice?

~~~
gizmo
What if I told you all of society is held together by duct tape? If you're
surprised that startups cut corners you're in for a rude awakening. I'm
frequently amazed anything works at all.

~~~
hardwaresofton
The real question is what holds together duct tape?

~~~
devdas
Duct tape is like the force. It has a light side and a dark side, and it holds
the universe together.

------
js2
If you're a sys admin long enough, it will eventually happen to you that
you'll execute a destructive command on the wrong machine. I'm fortunate that
it happened to me very early in my career, and I made two changes in how I
work at the suggestion of a wiser SA.

1) Before executing a destructive command, pause. Take your hands off the
keyboard and perform a mental check that you're executing the right command on
the right machine. I was explicitly told to literally sit on my hands while
doing this check, and for a long time I did so. Now I just remove my hands
from the keyboard and lower them to my side while re-considering my action.

2) Make your production shells visually distinct. I setup staging machine
shells with a yellow prompt and production shells with a red prompt, with full
hostname in the prompt. You can also color your terminal window background. Or
use a routine such as: production terminal windows are always on the right of
the screen. Close/hide all windows that aren't relevant to the production task
at hand. It should always be obvious what machine you're executing a commmand
on and especially whether it is production. (edit: I see this is in outage the
remeditation steps.)

One last thing: I try never to run 'rm -rf /some/dir' straight out. I'll
almost always rename the directory and create a new directory. I don't remove
the old directory till I confirm everything is working as expected. Really,
'rm -rf' should trigger red-alerts in your brain, especially if a glob is
involved, no matter if you're running it in production or anywhere else.
DANGER WILL ROBINSON plays in my brain every time.

Lastly, I'm sorry for your loss. I've been there, it sucks.

~~~
js2
Here's the .bashrc/.bash_login I use:

[https://gist.github.com/jaysoffian/8c75e661f7a61b0d094703e26...](https://gist.github.com/jaysoffian/8c75e661f7a61b0d094703e265d8d5b4)

------
ams6110
_23:00-ish

YP thinks that perhaps pg_basebackup is being super pedantic about there being
an empty data directory, decides to remove the directory. After a second or
two he notices he ran it on db1.cluster.gitlab.com, instead of
db2.cluster.gitlab.com_

Good lesson on the risks of working on a live production system late at night
when you're tired and/or frustrated.

~~~
theptip
Also, as a safety net, sometimes you don't need to run `rm -rf` (a command
which should always be prefaced with 5 minutes of contemplation on a
production system). In this case, `rmdir` would have been much safer, as it
errors on non-empty directories.

~~~
bpchaps
These days, I've been very implicit in how I run rm. To the extent that I
don't do rm -rf or rmdir (edit: immediately), but in separate lines as
something like:

    
    
      pushd dir ; find . -type f -ls | less ; find . -type f -exec rm '{}' \; ; popd ; rm -rf dir
    

It takes a lot longer to do, but I've seen and made enough mistakes over the
years that the forced extra time spent feels necessary. It's worked pretty
well so far -- knock knock.

~~~
kstrauser
BTW,

    
    
      find ... -delete
    

avoids any potential shell escaping weirdness and saves you a fork() per file.

~~~
wodny
This seems to be the best here. As a side note: if someone does something more
complicated and uses piping find output to xargs, there are very important
arguments to find and xargs to delimit names with binary zero -- -print0 and
-0 respectively.

Very interesting article: [https://www.dwheeler.com/essays/fixing-unix-linux-
filenames....](https://www.dwheeler.com/essays/fixing-unix-linux-
filenames.html).

~~~
willemmali
I've been writing an `sh`-based tool to check up on my local Git repos, and it
uses \0-delimited paths and a lot of `find -print0` + `xargs -0`:

[https://gitlab.com/willemmali-
sh/chegit/blob/master/chegit#L...](https://gitlab.com/willemmali-
sh/chegit/blob/master/chegit#L703)

I admit the code can look a little weird, but it was because I had some rather
tight contrainst: 1 file, all filenames `\0` separated internally and just
POSIX `sh`. I still wanted to reuse code and properly quote variables inside
`xargs` invocations (because `sh` does not support `\0`-separated read's), so
I ended up having to basically paste function definitions into strings and use
some fairly expansive quotation sequences.

~~~
bpchaps
Nice plug for gitlab ;).

\0 is an insanely useful separator for this sort of thing and yeah, it
definitely gets messy. I'm working on a similar project that uses clojure/chef
to read proc files in a way that causes as little overhead as possible. \0
makes life so much easier used. The best example I can think of off of the top
of my head is something similar to:

    
    
      bash -c "export FOO=1 ; export BAR=2 && cat /proc/self/environ | tr '\0' '\n' | egrep 'FOO|BAR'"
      FOO=1
      BAR=2

~~~
willemmali
I was so freaked out at the news, I normally have local backups of my projects
but I just happened to be in the middle of a migration where my code was just
on Gitlab, and then they went down... Luckily it all turned out OK.

\0 is very useful but I really wish for an updated POSIX sh standard with
first-class \0 support.

On your code, why do you replace \0's with newlines? egrep has the -z flag
which makes it accept \0-separated input. A potential downside to it is that
it automatically also enables the -Z flag (output with \0 separator).

I solved the "caller might use messy newline-separated data"-problem by having
an off-by-default flag that makes all input and output \0-separated; this is
handled with a function called 'arguments_or_stdin' (which does conversion to
the internal \0-separated streams) and 'output_list' (which outputs a list
either \0- or \n-separated depending on the flag).

------
nanch
I remember when we accidentally deleted our customers' data. That was the
worst feeling I ever had running our business. It was about 4% of our entire
storage set and had to let our customers know and start the resyncs. Those
first 12 hours of panic were physically and emotionally debilitating - more
than they have the right to be. I learned an important lesson that day:
Business is business and personal is personal. I remember it like it was
yesterday, the momement I conciously decided I would no longer allow business
operations determine my physical health (stress level, heart rate, sleep
schedule).

For what it's worth, it was a lesson worth learning despite what seemed like
catastrophic world-ending circumstances.

We survived, and GitLab will too. GitLab has been an extraordinary service
since the beginning. Even if their repos were to get wiped (which seems not to
be the case), I'd still continue supporting them (after I re-up'd from my
local repos). I appreciate their transparency and hope that they can turn this
situation into a positive lesson in the long run.

Best of luck to GitLab sysops and don't forget to get some sleep and relax.

------
totally
I had a great manager a little while back who said they had an expression in
Spain:

> "The person who washes the dishes is the one who breaks them."

Not, like, all the time. But sometimes. If you don't have one of these under
your belt, you might ask yourself if you're moving too slow.

If that didn't help, he would also point out:

> "This is not a hospital."

Whatever the crisis, and there were some good ones, we weren't going to save
anyone's life by running around.

Sure, data loss sucks, but nobody died today because of this.

I really appreciate the raw timeline. I feel your pain. Get some sleep.
Tomorrow is a new day.

~~~
drewmate
> Get some sleep.

Definitely get sleep, but it would be nice if the site were back online before
that. I actually just created a new GitLab account and project a couple days
ago for a project I needed to work on with a collaborator tonight. This is not
a good first impression.

~~~
sidlls
Paid or unpaid account and project?

------
niftich
I applaud their forthrightness and hope that it's recoverable so that most of
the disaster is averted.

To me the most illuminating lesson is that debugging 'weird' issues is enough
of a minefield; doing it in production is fraught with even more peril.
Perhaps we as users (or developers with our 'user' hat on) expect so much
availability as to cause companies to prioritize it so high, but (casually,
without really being on the hook for any business impact) I'd say availability
is nice to have, while durability is mandatory. To me, an emergency outage
would've been preferable to give the system time to catch up or recover, with
the added bonus of also kicking off the offending user causing spurious load.

My other observation is that troubleshooting -- the entire workflow -- is
inevitably pure garbage. We engineer systems to work well -- these days often
with elaborate instrumentation to spin up containers of managed services and
whatnot, but once they no longer work well we have to dip down to the lowest
adminable levels, tune obscure flags, restart processes to see if it's any
better, muck about with temp files, and use shell commands that were designed
40 years ago for when it was a different time. This is a terrible idea. I
don't have an easy solution for the 'unknown unknowns', but the collective
state of 'what to do if this application is fucking up' feels like it's in the
stone ages compared to what we've accomplished on the side of when things are
actually working.

~~~
chatmasta
Be careful not to overlook the benefits of instrumentation even in the
"unknown unknowns" scenario. If you implement it properly, the instruments
will alert you to _where_ the problem is, saving you time from debugging in
the wrong place.

The initial goal of instrumentation should be to provide sufficient cover to a
broad area of failure scenarios (database, network, CPU, etc), so that in the
event of a failure, you immediately know where to look. Then, once those broad
areas are covered, move onto more fine-grained instrumentation, preferably
prioritized by failure rates and previous experience. A bug should never be
undetectable a second time.

As a contrived example, it was "instrumentation," albeit crudely targeted,
that alerted GitLab the problem was with the database. This instrumentation
only pointed them to the general area of the problem, but of course that's a
necessary first step. Now that they had this problem, they can improve their
database-specific instrumentation and catch the error faster next time.

------
Walkman
Seems like very basic mistakes were made, _not_ at the event but way long
before. If you don't test to restore your backups, you don't have a backup.
How does it go unnoticed that S3 backups don't work for so long?

~~~
ocdtrekkie
Helpful hint: Have a employee who regularly accidentally deletes folders. I
have a couple, it's why I know my backups work. :D

~~~
connorshea
Even better, have a Chaos Monkey do it ;)

~~~
ocdtrekkie
Would you believe I have enough chaos already?

------
jlengrand
Gotta love the tweet though : "We accidentally deleted production data and
might have to restore from backup."

[https://status.gitlab.com/](https://status.gitlab.com/)

As usual, I really love the transparency they are showing in how they are
taking care of the issue. Lots to learn from

------
jamesmiller5
As I read the report I notice a lot of PostgreSQL "backup" systems depend on
snapshotting from the FS & Rsync. This may work for database write logs, but
it certainly will _corrupt_ live git repositories that use local file system
locking guarantees. NFS also requires special attention (a symlink lock) as
writes can be acknowledged concurrently for byte offsets unless NFSv4 locking
& compatible storage software is used.

Git repo corruption from snapshotting tech (tarball, zfs, rsync, etc):
[http://web.archive.org/web/20130326122719/http://jefferai.or...](http://web.archive.org/web/20130326122719/http://jefferai.org/2013/03/24/screw-
the-mirrors/)

Prev. Hacker News submission:
[https://news.ycombinator.com/item?id=5431409](https://news.ycombinator.com/item?id=5431409)

Gitlab, I know you are all under pressure atm but when the storm passes feel
free to reach out to my HN handle at jmiller5.com and I'd be happy to let you
know if any of your repository backup solutions are dangerous/prone to
corruption.

~~~
koolba
I see LVM[1] mentioned in the notes. It allows you to, among other things,
snapshot a filesystem atomically which you could then mount read-only to a
separate location to read for backups or export to a different environment.
That would give you a point in time view of the state of all the repos that
should be as consistent as a "stop the world then backup" approach.

[1]:
[https://en.wikipedia.org/wiki/Logical_volume_management](https://en.wikipedia.org/wiki/Logical_volume_management)

~~~
rleigh
LVM snapshots the raw block device (logical volume). The filesystem is layered
on top of that, and then open and partially written files on top of that. So
snapshotting an active database is really not the best idea; it might work, it
should work, but it'll need to discard any dirty state from the WAL when you
restart it with the snapshot. You might be in for more trouble with other data
and applications, depending upon their requirement for consistency.

It's definitely not as consistent as "stop the world then backup" because the
filesystem is dirty, and the database is dirty. It's equivalent to yanking the
power cord from the back of the system, then running fsck, then replaying all
the uncommitted transactions from the WAL.

It's for this reason that I use ZFS for snapshotting. It guarantees filesystem
consistency and data consistency at a given point in time. It'll still need to
deal with replaying the WAL, but you don't need to worry about the filesytem
being unmountable (it does happen), and you don't need to worry about the
snapshot becoming unreadable (once the snapshot LV runs out of space). LVM was
neat in the early 2000s, but there are much better solutions today.

~~~
koolba
> LVM snapshots the raw block device (logical volume). The filesystem is
> layered on top of that, and then open and partially written files on top of
> that. So snapshotting an active database is really not the best idea; it
> might work, it should work, but it'll need to discard any dirty state from
> the WAL when you restart it with the snapshot. You might be in for more
> trouble with other data and applications, depending upon their requirement
> for consistency.

> It's definitely not as consistent as "stop the world then backup" because
> the filesystem is dirty, and the database is dirty. It's equivalent to
> yanking the power cord from the back of the system, then running fsck, then
> replaying all the uncommitted transactions from the WAL.

I was referring to using LVM to snapshot the filesystem where the git repos
are hosted. It'd work for a database as well, assuming your database correctly
uses fsync/fdatasync, and for git specifically it works fine.

Using LVM snapshots with a journaled filesystem (i.e. any modern/sane choice
for a fs) should have no issues though there would be some journal replay at
mount time to get things consistent (v.s. say ZFS which wouldn't require it).
If it does have issues, you'd have the same issues with the raw device in the
event of hard shutdown (ex: power failure).

------
ploxiln
Amazingly transparent and honest.

Unfortunately, this kind of situation, "only the ideal case ever worked at
all", is not uncommon. I've seen it before ... when doing things the right
way, dotting 'I's and crossing 'T's, requires an experienced employee a good
week or two, it's very tempting for a lean startup to bang out something that
seems to work in a couple days and move on.

------
elliottcarlson
Regarding making mistakes:

Tom Watson Jr., CEO of IBM between 1956 and 1971, was a key figure in the
information revolution. Watson repeatedly demonstrated his abilities as a
leader, never more so than in our first short story.

A young executive had made some bad decisions that cost the company several
million dollars. He was summoned to Watson’s office, fully expecting to be
dismissed. As he entered the office, the young executive said, “I suppose
after that set of mistakes you will want to fire me.” Watson was said to have
replied,

“Not at all, young man, we have just spent a couple of million dollars
educating you.”

------
aubreykilian
This happened to me one night late a few years back, with Oracle on a CentOS
server. rm -rf /data/oradata/ on the wrong machine.

I managed to get the data back though, as Oracle was still running and had the
files open. "lsof | grep '(deleted)'" and /proc/<ORACLEPIDHERE>/fd/* saved my
life. I managed to stop all connections to the database, copy all the
(deleted) files into a temp directory, stop Oracle, copy the files to their
rightful place, and start up Oracle, with no data lost.

------
vhost-
I haven't seen or done anything of this scale before, but I did have a very
sobering moment while working on a large online retailers stack as a systems
engineer.

We were rolling out a new stack in another data center across the country and
before replication went live, I decided to connect and check things out. Our
chef work hadn't completed for the database hosts, so I decided to install
some OS updates by hand using pssh on all the MySQL hosts and saw a kernel
update. So I thought, the DC isn't live yet, no replication is running, I'll
just restart these servers. So I did using pssh again and then I caught a
glimps at the domain in some output and my face went completely pale. I
restarted the production databases... all of them. And they all had 256GB of
ecc memory. It takes a very long time for each of those machines to POST.

I contacted the client and said the maintenance page was my fault and was
fully expecting to be fired on the spot, but they just grilled me about being
careful in the future, and then laughed it off.

I've been the most careful ever since then. It scared me straight. Always make
sure you are in the right environment before you do anything that requires a
write operation.

~~~
olig15
Exactly the right response. You're not going to make that same mistake again,
but if you were fired, your replacement very well might.

------
CameronBanga
Not sure if the doc here is refreshing or scary. But Godspeed GitLab team.
I've loved the product for about two years now, so curious to see how this
plays out.

~~~
sbuttgereit
It's both.

I very much appreciate their forthrightness and the way they conduct their
company generally. Having said that, I have the code I work on, related
content, and a number of clients on the service.

[edit for additional point]

They need the infrastructure guy they've been looking for sooner than later. I
hope there's good progress on that front.

~~~
sytse
We've hired some great new people recently but as you can see there is still a
lot work to do. [https://about.gitlab.com/jobs/production-
engineer/](https://about.gitlab.com/jobs/production-engineer/)

~~~
mrmondo
I've just DM'd you in twitter with some PostgreSQL advice ;)

~~~
mrmondo
Note for the person that downvoted my comment: GitLab are fantastic, I'm a big
advocate for their product & support. My comment to Sid was in support of
helping based on some of the notes I saw in their very transparent report, Sid
& I have talked several times in the past & I have quite a bit of PostgreSQL
experience - so my comment was a positive one offering support when / if
needed, not a negative / piss take if it came off as such.

------
susi22
I was a Sysadmin for a university for a while. During those years I also
messed up once. I learnt one thing: If you run a command that cannot be
undone, you pause, take a few seconds and think. Is it the right directory? Is
it the right server (yes this has happened to me too). Are those the right
parameters? Can I simulate the command beforehand?

I introduced this double check for myself, and it has actually caught a few
commands I was about to run.

------
sciurus
This is why you have _lots_ of copies of your data. Matt Raney has a great
talk about designing for failure that includes details on Uber's "worst outage
ever." It too involved postgresql replication and mistaking on host for
another, but they didn't lose any data because they had more than a dozen live
copies of their database.

[https://www.youtube.com/watch?v=bNeZYVIfskc&t=26m54s](https://www.youtube.com/watch?v=bNeZYVIfskc&t=26m54s)

This isn't an alternative to working backups, of course, but it is an
additional safety net. Plus it can give you a lot more options when handling
an incident.

~~~
yellowbeard
More copies / replicas also means a larger attack surface.

------
travisby
Amazing document. Thank you for sharing. Taking it back to my company to make
sure we can learn from it and to know what to check (like our logical
backups... I know we've seen issues with our 9.5 servers and RHEL7 defaulting
to 9.1 or 9.2 on our host where we take the backups from! Verifying exit code
here we come...)

@sytse, I noticed you _do_ use streaming WAL replication, but I didn't notice
any mention of attempting Point In Time Recovery. Have you taken a look into
archiving the WAL files in S3? Those, along with frequent pg_basebackup's
(frequent because replaying a WAL file has been painfully slow for us) could
allow you to point in time recover to either a timestamp, or a transaction
(and before or after). [https://www.postgresql.org/docs/9.6/static/continuous-
archiv...](https://www.postgresql.org/docs/9.6/static/continuous-
archiving.html)

We use [https://github.com/wal-e/wal-e](https://github.com/wal-e/wal-e) to
manage our uploading to swift (no S3 at our company heh) and then inhouse
tooling to build a recovery.conf. Note we actually have our asynchronous
followers work off of this system too so they're not taking bandwidth from the
primary.

(note this is can lead to ~ 1 WAL file of data loss, but is acceptable for
us.)

I doubt I could be of any help, since reading the report definitely shows
y'all having an up on me with pg knowledge, but if there's anything I can do /
talk about feel free to reach out.

~~~
kogepathic
I know it's not wal-e, but Barman recently added support for streaming WAL
from postgres, so in theory you shouldn't lose any data if the master crashes.
Note that this does require a replication slot on the master to implement.

<rant> It's also stupid that you still have to set up WAL shipping (e.g. via
rsync or scp) before taking a base backup even if you have streaming
replication enabled. </rant>

That being said though, I have not been happy with the restore performance of
barman, though admittedly this may be I/O related.

------
polygot
"So in other words, out of 5 backup/replication techniques deployed none are
working reliably or set up in the first place."

~~~
echelon
Does this mean whatever was in that database is gone, with no available
backups?

Is this an SOA where important data might lie in another service or data
store, or is this a monolithic app and DB that is responsible for many (or
all) things?

What was stored in that database? Does this affect user data? Code?

~~~
YorickPeterse
We have snapshots, but they're not very recent (see the document for more
info). The most recent snapshot is roughly 6 hours old (relative to the data
loss). The data loss only affects database data, Git repositories and Wikis
still exist (though they are fairly useless without a corresponding project).

~~~
echelon
Best of luck with the recovery! I know this must be stressful. :(

------
gizmo
This is painful to read. It's easy to say that they they should have tested
their backups better, and so on, but there is another lesson here, one that's
far more important and easily missed.

When doing something really critical (such as playing with the master database
late at night) ALWAYS work with a checklist. Write down WHAT you are going to
do, and if possible, talk to a coworker about it so you can vocalize the
steps. If there is no coworker, talk to your rubber ducky or stapler on your
desk. This will help you catch mistakes. Then when the entire plan looks
sensible, go through the steps one by one. Don't deviate from the plan. Don't
get distracted and start switching between terminal windows. While making the
checklist ask yourself if what you're doing is A) absolutely necessary and B)
risks making things worse. Even when the angry emails are piling up you can't
allow that pressure to cloud your judgment.

Every startup has moments when last-minute panic-patching of a critical part
of the server infrastructure is needed, but if you use a checklist you're not
likely to mess up badly, even when tired.

~~~
ryandrake
If you get the chance to observe pilots operating in the cockpit, I'd
recommend it. Every important procedure (even though the pilot has it
memorized) is done with a checklist. Important actions are verbally announced
and confirmed: "You have the controls" "I have the controls". Much of flight
training deals with situational awareness and eliminating distractions in the
cockpit. Crew Resource Management[1].

1:
[https://en.wikipedia.org/wiki/Crew_resource_management](https://en.wikipedia.org/wiki/Crew_resource_management)

~~~
sschueller
There is a neat video[1] where a Swiss flight has to make an emergency landing
and just happens to have a film crew in the cockpit.

[1]
[https://www.youtube.com/watch?v=rEf35NtlBLg](https://www.youtube.com/watch?v=rEf35NtlBLg)

~~~
PKop
Here's a great documentary [0] by Errol Morris about the United Flight 232
crash in 1989 [1].

"..the accident is considered a prime example of successful __crew resource
management __due to the large number of survivors and the manner in which the
flight crew handled the emergency and landed the airplane without conventional
control. "

I highly recommend it

[0] [https://www.youtube.com/watch?v=2M9TQs-
fQR0](https://www.youtube.com/watch?v=2M9TQs-fQR0)

[1]
[https://en.wikipedia.org/wiki/United_Airlines_Flight_232](https://en.wikipedia.org/wiki/United_Airlines_Flight_232)

~~~
neurotech1
Another recent example is Qantas QF32 had an engine explode (fire then
catastrophic/uncontained turbine failure) and the A380 landed with one good
engine, and two degraded engines. The entire cockpit crew of 5 pilots did a
brilliant job in landing the jet.

------
ocdtrekkie
First of all, I want to say: I think the GitLab people are incredible. They're
open and transparent about everything they do, and I think that in a world
dominated by GitHub, the fact that GitLab exists is hugely important to
decentralization and the continued march of open source. This document
continues to demonstrate how the GitLab people are great, transparent people.

That being said, this is why you shouldn't entrust a cloud service to keep
your data safe: "So in other words, out of 5 backup/replication techniques
deployed none are working reliably or set up in the first place."

My backups work. I know they work, because I run them and I test them. People
entrusting cloud services to have good backups cannot say that.

------
mnarayan01
Start at:

> At this point frustration begins to kick in. Earlier this night YP
> explicitly mentioned he was going to sign off as it was getting late (23:00
> or so local time), but didn’t due to the replication problems popping up all
> of a sudden.

This is why I'm not a fan of emergency pager duty.

------
Heliosmaster
they are livestreaming their recovery process:
[https://www.youtube.com/watch?v=nc0hPGerSd4](https://www.youtube.com/watch?v=nc0hPGerSd4)

~~~
asmosoinio
Which is pretty amazing in my opinion!

------
lallysingh
I'm commenting a bit late, but I hope it's still read by the gitlab team.

First, you kept heads and didn't turn on each other. That's a major success,
and gives me more confidence in gitlab. The rest you can improve on _only_
_if_ you have this right.

Second, I'm sure you've gotten the message to test your backups and recovery
plan. It's a good time to read the Google SRE book, and consider how to put
together full integration tests that build up a db, back it up, nuke the
original, and recover from the db. With containers this isn't actually awful
to do.

But I didn't see much mentioned about load tests. A few simple scripts that
hit your server (or test instance!) hard can help you find points where things
fall apart under load. Even if you don't have a good way to gracefully do
anything else other than alert a human, you can figure out what to monitor and
how to make sure your backup/recovery plans can deal with a shit-ton of
spammer data suddenly in your DB.

~~~
cabargas
Thank you for your feedback. I will add your suggestions to our document!

We will be implementing new policies on backups and you can totally expect us
doing load tests in the future. The whole team wants to make sure this will
not happen again.

------
ChuckMcM
_So in other words, out of 5 backup /replication techniques deployed none are
working reliably or set up in the first place._

Ouch. That is so harsh. Sorry to hear about the incident. Testing ones backups
can be a pain to do but it is so very important.

------
V-2
I know it's not a laughing matter (and suddenly my working day just seems much
easier), but when I read

 _" YP says it’s best for him not to run anything with sudo any more today"_

I couldn't but smile. Yeah, well, that's a good point probably : )

------
musicmatze
They say that git data (repos and wikis) are not affected... well ... if they
would have had their PRs and Issues in git repositories, too...

Disclaimer: Worked on a POC for exactly this last semester and going to
publish my results in the next few weeks.

~~~
procrastitron
GitLab has an open issue ([https://gitlab.com/gitlab-org/gitlab-
ce/issues/4084](https://gitlab.com/gitlab-org/gitlab-ce/issues/4084)) to use
git-appraise ([https://github.com/google/git-
appraise](https://github.com/google/git-appraise)) for storing pull requests
in the repository.

~~~
musicmatze
I just checked git-appraise and while it looks rather mature, it uses (as far
as I can see by now) an approach which is not that nice if you have multiple
public remotes. Also, as far as I can see, each submitter must be able to push
to that remote - please correct me if I'm wrong.

We have a different approach for this, which is more powerful (talking about
how the data is stored).

Of course, our tool is not mature yet. Maybe gitlab can learn from what we've
researched...

------
nickjj
I wonder if this could have been avoided by using different subdomains.

For example, instead of db1.cluster, what if it were named db-
production.cluster. Would it have still happened? Probably not.

I could totally see myself accidentally typing db1 out of muscle memory, but
there's no way to accidentally type db-production.

~~~
wolfgang42
They already do this. Both db1.cluster and db2.cluster are production
machines; their staging equivalents are db1.staging and db2.staging. The
confusion was between two production instances--one of which had the latest
data, and the other was replicating. The intent was to delete the partially
replicated data but the command was run on the current master database server
instead.

------
coding123
We're using the Gitlab CE for our internal use, but one of our customer's is
on Gitlab.com. We're likely still going to recommend Gitlab.com after this,
but also enable mirroring to our internal instance.

As a side note, I just checked our S3 Gitlab backup bucket and it does have
backups for every day for the last year (1.8 GB each yikes) so instead of
failing to create the backups, its actually failing to delete the older ones!
:)

~~~
bm5k
Gitlab's automagic backup cleaner specifically does not work for S3 backups.
To automatically clean up your s3 bucket use S3's built in lifecycle settings.

------
cs2818
Wow, very intriguing to read. I appreciate honest event descriptions like
this.

This has reminded me how important it is to perform regular rehearsals of data
recovery scenarios. I'd much rather find these failures in a practice run.
Thanks to GitLab for continuing to openly share their experience.

------
TimWolla
Are 'YP' the initials of an employee or is this an acronym I don't know?

~~~
Perihelion
Yes, those are the initials of an employee here. Sorry for the confusion!

~~~
detaro
As much as I appreciate GitLabs extreme openness, that's maybe something that
by policy shouldn't be part of published reports. Internal process is one
thing, if something goes really bad customers might not be so good at
"blameless postmortems" if they have a name to blame.

~~~
sytse
That is why we went with initials. And I hope customers understand the blame
is with all of us, starting with me. Not with the person in the arena.
[https://twitter.com/sytses/status/826598260831842308](https://twitter.com/sytses/status/826598260831842308)

~~~
grhmc
It seems to me that, as a customer, it is blame-shifting away from the company
to a particular person. Blameless post-mortems are great, but when speaking to
people outside the company I think it is important to own it collectively,
"after a second or two _we_ notice _we_ ran it on db1.cluster.gitlab.com,
instead of db2.cluster.gitlab.com." I believe this isn't your intention, but
that is how I interpreted it.

~~~
greenleafjacob
In our postmortems we explicitly avoid referring to names and only refer to
"engineers" or specific teams. There is no reason to refer to specific names
if your intention is a systems/process fix.

~~~
antocv
To me those "Engineers" read as faceless replaceable cogs. This initials make
it personal, its better, we can now say "YP" thats exactly you, hey, chin up.
Sounds better than "engineering team 42".

You write CEOs name on all your publications, of course always taking
credit/glory, but why not let engineers do the same, take credit/ownership
when doing a nice commits, and when fucking up. We're all people first, and
prefer to speak/talk to people and not Engineering Team MailBox at Enterprise
Corporation.

------
LeonidBugaev
Not so long ago GitLab decided to move from using AWS Cloud to managing own
hardware. I wonder if such situation could happen if they used managed
Postgres with automatic backups. Most of us use Cloud because OP's is hard,
and human related risks are too high.

~~~
lbotos
That choice to go to bare metal was reversed:

[https://gitlab.com/gitlab-
com/infrastructure/issues/727#note...](https://gitlab.com/gitlab-
com/infrastructure/issues/727#note_20044060)

[https://webcache.googleusercontent.com/search?q=cache:M2CRY7...](https://webcache.googleusercontent.com/search?q=cache:M2CRY7xD6v4J:https://gitlab.com/gitlab-
com/infrastructure/issues/727+&cd=1&hl=en&ct=clnk&gl=us) (cache link until
GL.com is back up.)

~~~
motles
Does Azure not have managed postgres?

~~~
rrdelaney
Azure does not provide a managed postgres service. However it does have Azure
SQL[0], which is based on SQL Server.

Disclaimer: I work for Microsoft

[0] [https://azure.microsoft.com/en-us/services/sql-
database/](https://azure.microsoft.com/en-us/services/sql-database/)

------
ArtDev
I noticed the issue when I was pushing code earlier today. Hopefully this gets
resolved soon. You guys are doing a great job. Keep up the good work!

~~~
sytse
Thanks, not feeling great about the job we're doing today, but we'll learn
from this.

~~~
antocv
The way you are open about this and do not blame the engineer who did "rm
-Rvf" (he knows he fucked up, suffers enough already), and seeing that
improvements can be made and you're willing to do them.

Applying for work with you now, and moving all my stuff to GitLab.

~~~
mahnunchik
Move files to GitLab: git status & rm -Rvf .

------
jtchang
You know this actually makes me want to try out gitlab. They are super
transparent and the fact that they ran into this issue now means they will
just be better off in the long run. Does it suck? Sure. But this is why you
backup your own data anyway.

------
danaliv
_> 2017/01/31 23:00-ish YP thinks that perhaps pg_basebackup is being super
pedantic about there being an empty data directory, decides to remove the
directory. After a second or two he notices he ran it on
db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com_

A long, frustrating day. Running destructive commands at 11pm. This is why
pilots have duty time limits. YP should've been relieved by another engineer
who was physically and mentally fresher. Human beings have limits, and when we
reach them, we make more—and worse—mistakes. Any process that fails to account
for this is broken.

------
graphememes
If you haven't tested your backups, you don't have backups.

------
jfindley
"The replication procedure is super fragile, prone to error, relies on a
handful of random shell scripts, and is badly documented"

This is true of many databases, but in my experience is _particularly_ true of
postgres. It's a marvelous single-instance product, but I've never really
found any replication/ha tech for it that I've been that happy with. I've
always been a bit nervous about postgres-backed products for this very reason.

I'd be interested what other people's take on this is, though.

~~~
dijit
I don't share your observation at all.

If we're talking normal replication then I can tell you for certain that the
built in replication system is completely rock solid (especially compared to
MySQL).

If you're talking multi-master replication, then Citus is pretty solid, but
not nearly _as_ solid as replication.

If you're talking statement based replication, well, it's the same as all
databases. Here be dragons.

~~~
jfindley
The built-in replication may be good - but it's pretty new. Replication slots,
which are IMO vital for making the built-in replication non-fragile, only
arrived in 9.4, which is pretty recent of a release. I wonder how widely
tested it is, given that? How many people are actually using it for large
workloads?

Citrus I hadn't heard of though - that's interesting, thanks.

~~~
dmichulke
> only arrived in 9.4, which is pretty recent of a release.

Release Date: 2014-12-18 (from
[https://www.postgresql.org/docs/9.4/static/release-9-4.html](https://www.postgresql.org/docs/9.4/static/release-9-4.html))

Two years do surely not confer the status of "battle-tested" but I wouldn't
call it "recent" either. Then again, DB service level and a standard
application service level might differ by a few nines/sigma here.

Is it that what you're referring to?

~~~
jfindley
Okay, it's less recent than I remembered (how time flies!).

Still, users are typically fairly slow to update their DB software, so I
suspect 9.4 is still a pretty small percentage of the installed base at this
point.

------
tony-allan
It's good when companies are open and honest about problems.

I imagine they will have a great multi-level tested backup process in the next
day or two!

~~~
connorshea
It'll definitely be a priority now!

------
yjlim5
My team and I switched from bitbucket to gitlab a few months ago and we love
the transition. Gitlab provides a lot of value to me and my team who are
learning how to code while working on side projects. Although we cannot send
merge requests today because of this issue, we are all cheering them on. I’m
very happy that they are so transparent about their issues because my team and
I are learning so much from their report and insights here on HN comments.
Good luck!

~~~
eblanshey
Would you mind explaining what you like about gitlab better than bitbucket?
They seem to be on par with each other, including integrated CI.

~~~
songzme
I like gitlab's issue tracking. Its faster (from a productivity standpoint)
and easier to manage compared to Jira + Bucket, which feels a bit too bloated.

------
perlgeek
>2\. Regular backups seem to also only be taken once per 24 hours, though YP
has not yet been able to figure out where they are stored. According to JN
these don’t appear to be working, producing files only a few bytes in size.

>3\. SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries
are being run instead of 9.6 binaries. This happens because omnibus only uses
Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not
exist. As a result it defaults to 9.2, failing silently. No SQL dumps were
made as a result. Fog gem may have cleaned out older backups.

>5\. Our backups to S3 apparently don’t work either: the bucket is empty

I think we've all seen that with some kind of report or backup or that's
regularly reported: an empty file or none at all is generated due to some
silent error.

I highly recommend creating a monitoring check for file size for each
automatically generated file.

At $work, we also generate quite a few config files (for Radius, DHCP servers,
web servers, mail servers....). For those we have a mechanism that diffs the
old and the new version, and rejects the new version if the diff exceeds a
pre-defined percentage of the file size, escalating the decision to a human.

------
onetom
Thank you GitLab people for writing a log! It's just as important as the
service you are providing. Keep up the good job and never be afraid to talk
about your mistakes.

It would be also very educational if you could try to do a "5 whys" session
and share it too. The person who made the mistake deserves a bit of rest,
that's for sure. I wish she or he is supported by the team emotionally and not
just being blamed.

~~~
sytse
Totally agree with doing the 5 why's, see
[https://news.ycombinator.com/item?id=13539595](https://news.ycombinator.com/item?id=13539595)

------
zamalek
> Work is interrupted due to this, and due to spam/high load on GitLab.com

This is why spam should be illegal. The advertisers, the ISPs harboring them
or their country should be taken on for damages. This not only prevented
Gitlab from doing business, but also people who depend on them from doing
business. It's criminal.

------
tcpipcowboy
I'm curious to know what strategy has been developed out of this regarding
delivery of spam through creation of snippets. In the original GitLab First
Incident report it noted "spammers were hammering the database by creating
snippets, making it unstable". So many easily accessible platforms are out
there that this method of spamming could be used on that it seems like a
necessity to evaluate current workflows and identify where checks/balances can
be inserted that would prevent this from happening again. Short of removal of
snippets, there must be some method of snippet grepping that would put a pause
on suspicious snippets, preventing the bulk of submissions along the lines of
what GitLab initially received.

------
piinbinary
If you haven't done so recently, TEST YOUR BACKUPS.

------
riebschlager
I'm another happy GitLab user, but things like this always kinda freak me out.

Do any of you use any repo-mirroring strategy? Something a little more
automated than pushing to and maintaining separate remotes? For example, would
it be worth it to spin up a self-hosted GitLab instance and then script
nightly pulls from GitLab.com?

Edit: Answered my own question! If anyone else was curious:
[http://stackoverflow.com/questions/14288288/gitlab-
repositor...](http://stackoverflow.com/questions/14288288/gitlab-repository-
mirroring)

~~~
louiz
Why would you use gitlab.com if you have a self-hosted Gitlab instance?

~~~
riebschlager
Because I trust(ed) GitLab's backup procedures more than my own.

------
jorblumesea
So, maybe a stupid question from a frontend dev who doesn't deal with these
systems at all, but aren't these systems usually part of a cluster with read
replicas? Blowing away the contents of one box shouldn't destroy the cluster
right? I thought the primary/secondary pattern was really common among
relational databases and failover boxes and other measures were standard
practice. Was the command executed on all machines? Is the cluster treated as
one file system? Please excuse the ignorance.

~~~
terom
Replication systems can break. They work working on fixing the broken db2 read
replica when they accidentially nuked the primary db1 server.

> db2.cluster refuses to replicate, /var/opt/gitlab/postgresql/data is wiped
> to ensure a clean replication

> [...] decides to remove the directory. After a second or two he notices he
> ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com

That ops fail on db1 wouldn't have been such an issue if they weren't in such
a vulnerable position with an invalidated read replica.

------
leesalminen
This is the stuff my nightmares consist of after 900 consecutive days of being
on call (and counting).

~~~
slezakattack
Are you a one man team or...? My wife would probably leave me if I was on-call
for that long.

~~~
leesalminen
I am :/ ... currently maintaining service for 500 high-volume businesses
24x7x365 in 9 timezones. Luckily the product and infrastructure is pretty
stable and problems occur maybe once a quarter.

But the constant nagging in the back of your head that shit can go wrong at
any second is draining and has been the biggest stressor in my life for a long
time now.

My S.O. still gets mildly upset when I pack up the laptop on our way out to a
fancy dinner, or disappear with my laptop when visiting her parents, but the
fact that our life goals are aligned is the saving grace of all these
situations. We both know what we want out of the next 5 years of our lives and
are willing to sacrifice to achieve this goal (long term financial security).

~~~
overcast
I hope you are being SERIOUSLY compensated.

~~~
leesalminen
Cash salary today is well under market for my skill set. But, I do own 1/3 of
the company so it's not all bad :).

~~~
overcast
With all of that clout, why aren't you hiring?

------
CaliforniaKarl
First off, my heartfelt commiserations for the GitLab team here. My
suggestion: Start watching an hour-long video; the rsync will finish right
when it gets to the good part!

I wonder if a future project might be to have the DB-stored stuff use Git as a
replication back-end. Like, for example, having each issue be a directory, and
individual comments be JSON files. It would never (normally) be the data store
"of record" (the DB would), but maybe that would work as a backup?

~~~
jamesmiller5
Gerrit is working on just this. It uses `refs/meta/config` branch for project
configuration and is moving its database dependencies into a git database.
Reviews are stored in refs/changes/* . Backing-up a project & verifying it's
integrity is simple as `git clone --mirror`

------
ckdarby
"Our backups to S3 apparently don’t work either: the bucket is empty"

6/6 failed backup procedures. Looks like they are going to be hiring a new
sysadmin/devops person...

~~~
Washuu
The best system administrator is the one that has learned from their
catastrophic fuck up.

To that effect, I still have the same job as I did before I ran "yum update"
without knowing it attempts to do in place kernel upgrades. Which resulted in
a corrupted RedHat installation on a server we could not turn off.

~~~
overcast
There is learning from a catastrophic fuck up, and then there is incompetence.
Backups is like Day 1, SysAdmin 101. I can't quite grasp how so many different
backup systems were left unchecked. Every morning I receive messages saying
everything is fine, yet I still go into the backup systems, to make sure they
actually did run. In case there was issue with the system alerting me.

~~~
wtbob
> There is learning from a catastrophic fuck up, and then there is
> incompetence.

We all start at incompetence, but eventually we — wait for it — learn from our
experiences. Would you believe that Caesar, Michael Jordan and Steve Wozniak
once were so incompetent that they couldn't even control their bowels or tie
their shoes? They learned.

Is it possible that the guys in the team running GitLab's operations were
misplaced? Certainly — that's a _management_ issue. And I can guar-an-tee you
that GitLab now has a team of ops guys who viscerally understand the need for
good backups: they'd be insane to disperse that team to the winds.

~~~
overcast
There's no excuse for backups not being setup, period. For such a high profile
site, and the rigorous hiring circus they put candidates through. This doesn't
fall under "a learning experience". I wish them luck, but this is just gross
negligence.

------
ygersie
Funny that I haven't seen the following anecdote yet: One is none, two is one
(== none). AFAICT there were only 2 Postgres instances? What gives? How would
you ever feel comfortable when one goes down?

How we deal with recovery:

\- run DB servers on ZFS \- built a tool to orchestrate snapshotting (every 15
minutes) using an external Mutex to distribute snapshot creation for best
recover accuracy. You could also have increased retention over time like:

\- keep 6 snapshots of 5 minutes \- 4 hourly \- 1 daily \- 1 weekly

Recover: choose point in time closest to fckup, the tool automatically elects
the DB with closest (earlier than given time) snapshot. All other slaves are
restored before that point in time and roll forward to the active state of new
"master".

Instead of executing worst case recovery plans by copying data to at least 6
(minimal) db read slaves we can recover in minutes with minimal data loss
(especially when you consider downtime == data loss).

There are cases where a setup like this would be a no go (think of companies
where having lost transactions are absolutely devastating) but I don't think
Gitlab is one of those.

Side effect of ZFS is being able to ship blocks of data as offsite backups
(instead of dumping), able to `zpool import` anywhere, checksumming,
compression etc etc..

------
jlgaddis
Apparently the BOFH works for Gitlab these days:

> _It 's backup day today so I'm pissed off. Being the BOFH, however, does
> have it's advantages. I reassign null to be the tape device - it's so much
> more economical on my time as I don't have to keep getting up to change
> tapes every 5 minutes. And it speeds up backups too, so it can't be all bad
> can it? Of course not._ \--bofh, episode #1

------
aidos
I really feel for everyone involved. Knowing Gitlab, they'll learn and become
better for it.

I've been using the PS1 trick they mention for the last couple of years and
I've found it to be a really good visual check (red=prod, yellow=staging,
green=dev). We then also apply the colorscheme to the header in our admin
pages too. Those of us that are jumping between environments are a big risk to
data :-)

------
technion

        a user for using a repository as some form of CDN, resulting in 47 000 IPs signing in using the same account 
    

I'd be interested in how this occurs. Simply linking a raw file in a
repository would surely not require a sign in. Did someone come up with some
way of automatically using credentials on a download link?

47 000 simultaneous users suggests that wouldn't be a small project that did
so.

~~~
justinclift
As a complete guess, something like using sessions persisted back to the
PostgreSQL database, _without_ something like memcached in front of it.

With that kind of approach it could be trying to update a session table (with
new IP address?) for literally every page load by the 47,000 people. Which
would probably suck. ;)

------
sashk
Backups sucked for the starting in 8.15 on our instances of GLE, because
someone decided to add "readable" date stamp in addition to unix timestamp in
backup file name without proper testing, which caused many issues. It was
somewhat fixed, but I do still issues in 8.16.

I'm not complaining, but backup/restore is important part, with 100% test
coverage and daily backup/restore runs.

------
anonfunction
You don't have a production ready backup system until you've used it.

~~~
fsiefken
Yes, when I was responsible for databases and servers 'red alert' file where
all the worst case scenario's were described with the recovery procedures
which were tested every half year or so. This came about after these
scenario's happened one after the other and I had to fix them manually. One of
them, a hard disk crash, were a tape backup failed for some reason, and the
tape from two days ago was to outdated. I did a filesystem recovery, then a
database recovery and consistency check, mounted the files and looked at which
tables were corrupted and restored those from backup. I didn't want to through
this (with a huge time pressure) ever again. After that we decided to check if
all backups succeeded and were consistent so everyone on the team could
restore within an hour.

------
dorianm
> 2017/01/31 23:00-ish

> YP thinks that perhaps pg_basebackup is being super pedantic about there
> being an empty data directory, decides to remove the directory.

> After a second or two he notices he ran it on db1.cluster.gitlab.com,
> instead of db2.cluster.gitlab.com

> 2017/01/31 23:27

> YP terminates the removal, but it’s too late. Of around 310 GB only about
> 4.5 GB is left

The naming couldn't be more confusing

~~~
richardwhiuk
In what way? They have two production database servers, db1 and db2.

------
scurvy
At my $currentJob, we restore 30TB worth of database restores every day in
automated processes.

We do this because the DBA insisted that the DB backup process was fine. We
tried to restore 3 backups as a test, and they all failed. We no longer have
DBA's. We have automated procedures and very thorough testing. Zero failed
restorations since then.

------
zenlikethat
Hang in there GitLab, it sucks now but mistakes like this can happen to
anyone. The important thing is how you deal with it.

------
adamkittelson
I think the fact that no one seems to even remember that GitHub had a similar
incident in 2010 [https://github.com/blog/744-today-s-
outage](https://github.com/blog/744-today-s-outage) is a good thing to keep in
mind for the GitLab team. This too shall pass etc.

------
alyandon
Thank you for the transparency. This is a good read and I'm going to be
sharing it with coworkers tomorrow. :)

~~~
connorshea
Glad to hear others will learn from our mistakes, we certainly are :)

------
mdekkers
_YP thinks that perhaps pg_basebackup is being super pedantic about there
being an empty data directory, decides to remove the directory. After a second
or two he notices he ran it on db1.cluster.gitlab.com, instead of
db2.cluster.gitlab.com_

My heart honestly goes out to YP. This is a terrible sysops _oh shit_ moment.

------
dx034
The recovery process seems very slow with ~50mbit/sec. Could that be an issue
related to cloud providers? I heard that issue quite often when dealing with
AWS/Azure. Even HDDs should have much higher throughput for that kind of
transfer.

If they had dedicated hardware in 2 datacentres on the same continent, copying
between those servers should easily be possible at 250mbit/s or more (from my
experience). Especially as they seem to copy at the US east coast, where it's
now night.

For me, that would be a serious issue dealing with cloud providers. If I have
a server with a 250mbit connection, I expect to be able to copy data between
datacentres at that speed. And I never had problems with OVH, Hetzner and the
like.

~~~
rsynnott
This wouldn't be a problem with modern AWS instances. Possibly it's an Azure
thing?

------
dboreham
Well...never delete. Rename the directory someplace else so you can get it
back if the deletion was a bad plan.

Also helpful to make the window background color different, or some other
highly conspicuous visual difference when working on several very similar
production machines.

------
INTPenis
[http://checkyourbackups.work/](http://checkyourbackups.work/)

It hits home.

I've seen three backup methods fail when it came time for an emergency
restore, due to lack of competence, confusion and lack of regular restore
tests.

------
luckystartup
I have at least 10 private repos on GitLab, and many public ones. Even so,
this is no big deal to me. That's the beauty of git. Even if all of their
backups fail, I can just do a push and everything is back up there.

I just hope my laptop doesn't die before they get it back online.

EDIT: Was fun to put this little command together. Run this from your code
directory, and it will push all of your gitlab repos. I'm going to run it when
GitLab is back online.

    
    
        find . -maxdepth 3 -type d -name '.git' -exec bash -c 'cd ${@%/.git} && git remote -v | grep -q gitlab.com && echo "Pushing $PWD..." && git push' -- {} \;

~~~
wolfgang42
From the incident report:

> Git repositories are NOT lost, we can recreate all of the projects whose
> user/group existed before the data loss, but we cannot restore any of these
> projects issues, etc.

Your fancy snippet will report that it has pushed no changes. The data that
was lost was new issues, PRs, issue comments, and so on; I've never heard of
anyone keeping backups of these on their local laptops.

~~~
luckystartup
> I've never heard of anyone keeping backups of these on their local laptops.

Hmm... That's an interesting idea!

You could do that on a separate (empty) branch. Maybe call it `__project`, and
you could just have folders of markdown files. You could have two root folders
for `issues/` and 'pull_requests/', and two subfolders in each for `./open/`
and `./closed/`. And a simple command-line tool + web UI. You could just edit
the file to add a comment.

It would be really nice to have a history and backup of all of your issues. I
also like the fact that you could create or edit issues offline.

Then you could also set up a 2-way sync between your repo and GitLab / GitHub
/ Trello.

~~~
zokier
That sort of "inline" issue tracking is a thing. I think Bugs Everywhere[1] is
one of the more mature systems based on the idea. There are several others
too[2], most of them unmaintained. There are also wiki-style systems based on
the same idea.

[1] [http://www.bugseverywhere.org/](http://www.bugseverywhere.org/) [2]
[http://www.cs.unb.ca/~bremner/blog/posts/git-issue-
trackers/](http://www.cs.unb.ca/~bremner/blog/posts/git-issue-trackers/)

------
derricgilling
I like to always try to automate stuff as much as possible to remove human
error. It's easy to forget an arg or other parameter on a script or even know
what the arg was for in the first place. Sounds like their last backup was 24
hrs ago. Having backups are like having good security, you don't realize how
important they are until it's too late. Reminds me of this old meme:

[http://www2.rafaelguimaraes.net/wp-
content/uploads/2015/12/g...](http://www2.rafaelguimaraes.net/wp-
content/uploads/2015/12/giphy2.gif)

------
samblr
Have been a happy gitlab user. Reading this worries me, as honestly I do not
know what I have personal backup of.

But I appreciate transparency with which Gitlab are dealing the issue. And I
hope Gitlab will bounce back for good.

------
umbrai_nation
"Of around 310 GB only about 4.5 GB is left"

What is gitlab storing in their database? From what I understand, the repos
were untouched by the DB problems, so what is taking up a third of a terabyte
of DB space?

~~~
YorickPeterse
issues, merge requests (titles, descriptions, etc), comments, events ("X
pushed to Y"), labels, projects, milestones, users, permissions, project
statistics, CI builds information, abuse reports, the list goes on.

------
beezischillin
As a small team made up of 3 companies that work closely together, we all use
GitLab's services daily for our work. Thank you for the great service and we
wish you a speedy and painless recovery!

------
justinclift
"Removed a user for using a repository as some form of CDN, resulting in 47
000 IPs signing in using the same account (causing high DB load)."

No memcached in front of PostgreSQL?

------
huula
As a user of GitLab who commits several times everyday, I have all the respect
for the great work you guys have done. Keep hacking and don't worry!

------
oelmekki
I had to restore a rails app from a day old backup once. I actually manage to
bring last day data as well by parsing POST/PUT/PATCH lines in rails log. This
is painful, and you have to keep track of new ids for relations, but it
"works" (obviously, there is info you can't retrieve that way, but in those
situations, anything more than nothing is good).

------
btgeekboy
> Our backups to S3 apparently don’t work either: the bucket is empty

followed by

> So in other words, out of 5 backup/replication techniques deployed none are
> working reliably or set up in the first place.

is no way to be running a public service with customer data. Did the person
who set up that S3 job simply write a script or something and just go "yep,
it's done" and walk away? Seriously?

~~~
overcast
Apparently the following insane interviewing process, wasn't enough to find
someone competent enough to cover the basics.

[https://about.gitlab.com/jobs/production-
engineer/](https://about.gitlab.com/jobs/production-engineer/)

\-------------------

Applicants for this position can expect the hiring process to follow the order
below. Please keep in mind that applicants can be declined from the position
at any stage of the process. To learn more about someone who may be conducting
the interview, find her/his job title on our team page.

Qualified applicants receive a short questionnaire and coding exercise from
our Global Recruiters

The review process for this role can take a little longer than usual but if in
doubt, check in with the Global recruiter at any point.

Selected candidates will be invited to schedule a 45min screening call with
our Global Recruiters

Next, candidates will be invited to schedule a first 45 minute behavioral
interview with the Infrastructure Lead

Candidates will then be invited to schedule a 45 minute technical interview
with a Production Engineer

Candidates will be invited to schedule a third interview with our VP of
Engineering

Finally, candidates will have a 50 minute interview with our CEO

Successful candidates will subsequently be made an offer via email

~~~
developer2
>> candidates will be invited to schedule a first 45 minute behavioral
interview with the Infrastructure Lead

Yes, go right ahead and filter out some (disclaimer before the rant: some, not
all) of the best talent. The kind of potential employee that gets rejected due
to perceived personality problems is exactly the kind of person who would tell
management to shove a stick up their ass for demanding a 2 week deadline for a
project requiring 3 months to execute properly.

Maybe if GitLab had hired the best talent, instead of the best
"behavioral/cultural fit", at least one of their 5 backup systems would have
been functional. Many people who are perfectionists in their craft, who would
never have allowed this kind of failure to take place under their watch, come
with abrasive personalities. If you only hire those who are submissive during
the interviewing process, you will get exactly what you chose - people with no
backbone to push back against unreasonable business expectations.

Case in point: would you want to hire me based on this comment of mine? Hell
no! You're going to steer clear of me and give me an instant fail during a
"behavioral interview", because you can't look past my belligerence to
understand that there is value in having employees who obsess over the little
things like having systems that do what the fuck they're supposed to do,
rather than being able to give a conformant first impression full of social
prowess. "Whoa, he used the word 'fuck' to hammer home his point; definitely
avoid hiring this guy!"

tldr; Sometimes, people who are "talented" or "skilled" get to that point by
being obsessive freaks who sit at home in the dark all night hacking away at
stuff, with no social lives. The result can be someone who knows what they are
doing because they invest all their personal free time into the domain, but
consequently has absolutely no social skills to put on display.

shorter tldr; Businesses focus on the liability of a person without
considering the potential.

~~~
elygre
Not vetting people based on behavior is perilously close to accepting the old
adage "Say what you like about Mussolini, but at least he made the trains run
on time".

~~~
watwut
And the fact is, trains were not on time at that time. It was mostly
propaganda.

------
hardwaresofton
Working on a Gitlab project right now, just noticed the site was down, thanks
to the team for working so hard to fix/rectify this mistake and being totally
open about it.

Appreciate the openness and utility of Gitlab (as I've said in other threads),
I'm sure it's frustrating to have this happen, but hang in there! services
generally have 99.9% uptime anyway :)

------
1945
I applaud their honesty, most companies would choose to keep customers in the
dark while the investigate "an incident."

------
rocky1138
There should be a 15 second warm-up after any `rm -rf` command in order to
give admins time to cancel a stupid move.

------
perryprog
And this is why I still use GitHub. It's a shame as GitLab was looking like a
nice alliterative (for self hosting, might still try it some time), but if
this sort of thing could even possibly happen.....

I didn't read to much into this, but they really didn't haven any backup on
the databases?

~~~
Slaul
A very similar thing happened at GitHub a while back:
[https://github.com/blog/744-today-s-
outage](https://github.com/blog/744-today-s-outage)

------
msimpson
Mistakes will happen, sure. But given the following:

> The replication procedure is super fragile, prone to error, relies on a
> handful of random shell scripts, and is badly documented

> Our backups to S3 apparently don’t work either: the bucket is empty

It seems like a lot of their backup and restore procedures were never even
tested.

------
huula
I once rm-ed my home directory when I was writing and testing a script, but
turned out the stuff like .m2, .ivy2 are huge and they are the first ones by
default to be deleted by 'rm -rf'. So they kind of gave me some buffering time
to figure out that something was wrong.

------
ausjke
That's why I do not use github/gitlab/whatever to host the part of my code
that is too critical to me. I push it to my ssh/git server and use local UI to
interact with it instead.

Sometimes source code is very valuable and you just can not make any mistakes
with it.

~~~
jschulenklopper
> I do not use github/gitlab/whatever to host the part of my code that is too
> critical to me

In this specific case, GitLab mentioned that code repositories are fine. It's
the database part with issues and pull requests that they are restoring.

[https://twitter.com/gitlabstatus/status/826662763577618432](https://twitter.com/gitlabstatus/status/826662763577618432)

------
stephenr
Point #3 is a good example of why "omnibus" packages are a fucking terrible
idea.

------
jeffmcjunkin
For item 3h under recovery, consider:

    
    
      chattr +i /var/opt/gitlab/postgresql/data
    

Yes, it doesn't completely stop foot-guns, but it means you have to shoot
twice [0].

[0]:

    
    
      chattr -i /whatever
      rm /whatever

~~~
artursapek
Does that prevent postgres from modifying it?

------
z3t4
There's really no point doing intensive congnitive work for more then 8 hours
straight. After that you go by instinct and muscle memory. Surprisingly a lot
of tasks can be done, but you shouldn't do anything critical.

------
kkirsche
While this is sad to see it is a lesson to us all and I've shared it with
coworkers who haven't taken disaster recovery as seriously as we should on our
projects. Hopefully this will help raise this as a priority.

------
arkh
> Create issue to change terminal PS1 format/colours to make it clear whether
> you’re using production or staging (red production, yellow staging)

QubesOS anyone? It could be a good idea to have a Qube per environment
targeted.

------
thepumpkin1979
I wonder if they have some kind of protocol to modify production environments
that was somehow overwritten, one person creates the bash scripts and a second
one reviews and executes, never a single person.

------
TeeWEE
GitLab is run completely decentralized, and they are underpaying their
developers. I think thats one of the reasons for this failure to happen.

Decentralized work, can work. But face2face communication is important.

------
ishitatsuyuki
I was using GitLab for the huge number of features, but moved away to GitHub
due to the awful uptime and server speed.

BTW, overhosting is always a big risk, since it takes along period to catch up
for the incident.

------
xfactor973
I'm surprised they didn't take any hourly backups.

------
arc_of_descent
Don't prepare a backup plan. Prepare a recovery plan.

~~~
pmlnr
I agree with this. Prepare for destroyed (burnt-down level) machine, for
datacentre failure, for stolen home server, for scratched blu-ray archives -
in short, for the worst. And of course, hope for the best.

~~~
djsumdog
It's easier said than done in some companies where stake holders are always
pushing for new stuff.

Unless those who fund what you're doing understand why disaster recovery is
vital, you're going to see this.

Ideally you want devops in such a state that you create new lower environments
that mirror production, complete with state/backup restoration, that's run
automatically every week.

------
wruza
rm is a part of coreutils, right? Why not just substitute it with a less
destructive script moving files to /mnt/oops/<origpath-date>/<filename> in the
next release? Badasses can echo badass > /etc/rm.conf to get the original
behavior.

Admins have to reinvent that bicycle for decades, stop it now, please!

------
neolawliet
This is kind of tangential. Does anyone have any sort of DevOps 101 resource
(books, links, etc.) for new DevOps engineers?

------
scandox
A lot of sympathy for YP. I could have been that person (on a smaller scale)
many times. An expensive lesson in this case.

------
gagabity
Nuts, doesn't Azure have a managed Database option that can be used like AWS?

------
thosemagicstars
We all fuck up, time to recovery and lessons learned is all that matters.

------
pits
Je Suis YP

------
jincheker
Incidents after incidents, feels like a toy by new grads.

------
bkbridge
This stuff is complicated. No one gets that, no one.

------
therealmarv
I only hope you prepare for DDoS attacks of this kind. Github was fighting
with this stuff the last several years... you maybe only have luck. Please
prepare!

------
soheil
bitbucket offers free private repos.

~~~
knocte
gitlab too, plus they are opensource

------
grovegames
Transparency should be like a sheer dress, enough to see, but still covering
the bits that need to stay hidden.

This is just nudity.

------
Traubenfuchs
So tl;dr: Gitlab is experiencing heavy DOS attacks that created so much data
that replication stopped working. In the process of getting replication to
work again, "YP" wanted to delete the empty data directory of the slave DB
server, but accidentally deleted it on the master DB server. Out of 5
backup/replication techniques they use not one is working reliably. YP
manually created the backup they could use 6 hours ago by chance.

Tell me if I misunderstood something. I hope the customer I met last week does
not remember I ever recommended GitLab to him.

------
frik
The major Azure problems like two years ago were documented by GitLab in
similar manner. I find the openess a good thing, even in not so good times.
Thumbs up.

~~~
dimitrie
Thanks for the thumbsup!

------
wildchild
Transparency maybe ok, but live stream is a theatre indeed.

------
jjawssd
Their network file transfer is super slow

------
foo101
Can someone elaborate who or what is YP?

~~~
grzm
He's identified elsewhere in the thread as Yorick Peterse:

[https://news.ycombinator.com/item?id=13537132](https://news.ycombinator.com/item?id=13537132)

------
SeriousM
"somehow disallow rm - rf"... Well, you didn't understood the linux

