
Data Loss at GitLab - umairshahid
http://blog.2ndquadrant.com/dataloss-at-gitlab/
======
aamederen
With this incident, they once again showed that they are dedicated to
transparency, even in the worst days. This increased their popularity on me
and I believe among other developers. However, this may not be the case with
the business people. I hope they can survive that and also publish a guide for
getting better at the "ops" side of the things.

~~~
praneshp
HN can be funny sometimes. GitHub got a lot of hate about a year ago just for
not releasing new features. GitLab cost everyone a day because their
backup/ops practices were silly, and everyone loves them more.

I've screwed up before, and I sympathize/empathize with their ops folks, but
this should make us think about plan B in case something like this happens
again.

~~~
deckar01
> GitLab cost everyone a day

GitLab isn't popular because of the stability of its cloud platform. It's
popular because you can install your own instance for free practically
anywhere with minimal effort.

I run GitLab CE on a box in my server closet for projects that involve
livelihoods.

~~~
thewhitetulip
>It's popular because you can install your own instance for free practically
anywhere with minimal effort.

If that is the reason why you use Gitlab, then why not try gitea or gogs? gogs
is written in Go and provides a docker image or a drop in binary

~~~
problems
GitLab is a lot more feature packed than Gitea/Gogs. Gogs is lightweight, good
for personal projects, but if you're looking for something to deploy company
wide with integrated everything, Gitlab is the way to go.

~~~
sdesol
I think this is changing at a pretty fast pace ... well for Gitea anyways. It
also looks like Gitea is lighting a fire under Gogs, as they appear to be
iterating at a faster pace as well. Here's a very quick breakdown of what's
going on.

Activity for the last 160 days. There were 175 commits to gogs and 720 commits
to gitea.

[https://gitsense.com/gogs-
gitea/commits-160days.png](https://gitsense.com/gogs-
gitea/commits-160days.png)

Activity for the last 60 days. There were 109 commits to gogs and 262 to
gitea.

[https://gitsense.com/gogs-
gitea/commits-60days.png](https://gitsense.com/gogs-gitea/commits-60days.png)

[https://gitsense.com/gogs-
gitea/changes-60days.png](https://gitsense.com/gogs-gitea/changes-60days.png)

[https://gitsense.com/gogs-gitea/changes-
files-60days.png](https://gitsense.com/gogs-gitea/changes-files-60days.png)

The options and vendors directory are unique to Gitea and they account for a
lot of the changes within the last 60 days. I was told the vendors directory
is used to store dependencies but I don't know what the options directory is
used for. And as the following shows, they account for a lot of the files
touched, in the last 60 days.

[https://gitsense.com/gogs-gitea/changes-options-
vendor-60day...](https://gitsense.com/gogs-gitea/changes-options-
vendor-60days.png)

Based on what I've read on Hacker News, the developer behind Gogs, tends to
merge in changes in spurts, so it's hard to tell if this recent flurry of
activity is a spurt or not. In this 365 days of activity, you can see the 3
spurts for Gogs so far.

[https://gitsense.com/gogs-
gitea/commits-365days.png](https://gitsense.com/gogs-
gitea/commits-365days.png)

Regardless of whether or not Gogs will continue to develop at an increased
rated, it looks like Gitea will.

------
synicalx
Honestly, I'm completely flabbergasted by this. Five backups, and NONE worked
properly? Who made this? The S3 bucket was EMPTY? Has no one ever tested any
of these backups?

It's not just the impact, which is fairly sizeable in it's own right, but it's
the HUGE oversight on their part and the fact they tried to pin part of this
on PostreSQL?

Credit where it's due; their report/transparency were good if a little
unprofessional, and something I'd like to see more of from other companies.

Putting on my BOFH hat, this is what happens when you let Devs do operational
stuff.

~~~
JohannesH
Untested backup === No backup

People like to say they have backups or a "backup procedure", but in my
experience almost none of them ever tested the backup... Not even once. 95% of
the time "having a backup procedure" just means "we have a replica of some
data sitting somewhere with no idea how/if we can restore it, or how long it
takes".

~~~
baq
it's worse. you might be trying to restore from an untested backup and fail,
wasting precious time.

five times.

------
activatedgeek
While this is disastrous, I still think Gitlab is the best thing happened to
OSS. This could be taken as a rhetoric, but on a more actionable side, we must
all learn from Gitlab's experience. Almost everybody experiences this issue,
but very few come out clean.

~~~
nicpottier
I think you are confused.

The best thing to happen to OSS is GitHub, not GitLab. GitLab is just a fast
follower and likely wouldn't even exist without the former.

I for one am happy to throw money GitHub's way for their role in so
dramatically changing how we code.

~~~
Thaxll
You code differently using Git vs SVN? That's an interesting concept. It
doesn't change the way you code it just changes the way we share code.

~~~
jasonwatkinspdx
Git makes multi-tasking on multiple branches far easier than SVN. In the SVN
days I always had multiple checkouts and a pile of scripts for
changing/updating to branches for handling a few regular tasks in that
context. Once git won I was able to delete them and have a far simpler
workflow.

------
neals
I don't get to work on databases this size and today has been an incredible
lesson and a journey. I've been reading all the comments and blogs, watching
the stream and Googling what I didn't know or understand.

I feel like the next step for me is scaling my business so that we have an
actual usage for my newly found interests :)

~~~
sulam
Then you end up having deep expertise in a topic that's only important for
larger companies. Sometimes this works out well, sometimes it leaves you a
little stuck. :)

~~~
stusmall
Meh. Just learn, learn, learn. If it isn't completely applicable to your life
that's okay. Sometimes there are pearls of wisdom in best practices in field
completely unrelated to your own. Learning something new is always a good
thing.

~~~
haggy
THIS ^^^ So much this. I can't tell you how many times I picked up a book or
paper thinking "There's no way i'm going to get anything new out of this". As
I start reading, I start to find little tidbits of information that make me
think in ways I didn't before.

------
intsunny
The gitlab situation and Uber's article speak to the level of immaturity of
PGSQL's native replication feature, and more importantly: how not widely
google-able nor documented/adopted the replication strategies are.

~~~
tshannon
I believe gitlab used slony, not the native replication. I'm not well versed
in postgres, but that's what I gleaned from reading their event log.

~~~
YorickPeterse
As mentioned below we only used Slony to upgrade from 9.2.something to 9.6.1.
For regular replication we use PostgreSQL's streaming replication.

------
wheelerwj
i wish everything was discussed/handled as publicly and transparently as this
whole scenario.

i really hope this becomes a thing.

~~~
cornedor
I totally agree, the live stream [1] is amazing, discussing steps in the open
like that and if there is some time left answer questions from the YouTube
chat.

[1]
[https://www.youtube.com/c/Gitlab/live](https://www.youtube.com/c/Gitlab/live)

~~~
jobvandervoort
Glad to hear this was appreciated. It was an experiment -one we only hope to
repeat in other scenarios.

~~~
throwaway7767
As a sysadmin, I'd find it incredibly distracting to be on a livestream while
trying to fix a critical issue. For your employees sake I hope you don't do
this again.

Have a single point of contact that provides information about the recovery
process. Being transparent and providing technical info is good, but that task
should not be handled directly by the admins at the same time they are
focusing on the drop-everything-shit-is-broke emergency.

------
hyperpape
I'm surprised by the statement that 4 gigs of replication lag is normal.
However, I don't manage backups for anything larger than personal pet
projects, so I don't have a sense of scale.

~~~
aexaey
> don't have a sense of scale

From [1], complete db is ~300GB and from some iffy pixel measurement of the
graph at the very bottom of that page, copying speed between otherwise idle db
hosts was about 22.8 GB/hour (in-production replication is probably slower
than that).

From that, 4GB of replication lag would represent 1.3% of db by size, or 10+
minutes of lag (as measured by time required to catch up under ideal
circumstances).

[1] [https://about.gitlab.com/2017/02/01/gitlab-dot-com-
database-...](https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-
incident/)

~~~
hyperpape
I didn't think to eyeball the graph to guesstimate how long the 4GB translated
to, so thanks.

However, scale was the wrong word for what I was wondering about. My question
should've been whether 1% of your total DB/10 minutes of replication lag seems
reasonable/nothing to worry about, like the article suggested.

~~~
joking
It was an issue, part of the reason that a tired person was working trying to
reduce it.

------
marricks
From their blog post,

> So in other words, out of 5 backup/replication techniques deployed none are
> working reliably or set up in the first place. We ended up restoring a 6
> hours old backup.

That must be _terrifying_ to realize. I mean, thank goodness they had a 6 hour
old back up or they'd be in such an awful spot.

~~~
cptskippy
I would counter that they're still in an awful spot because this announcement
reeks of incompetence and isn't something you want to hear from the guys
you're entrusting with keeping your code safe.

It would be like Boeing or Airbus announcing all the safety features on their
airliners were non functioning.

~~~
testUser69
The fact that they were upfront and honest about it and that they even live
streamed themselves fixing the problems makes me want to use gitlab even more.
If anything I have even more confidence in them. You didn't hear a peep from
Microsoft when the forced windows 10 upgrade bricked thousands of laptops.
Perhaps that's why such a huge portion of developers prefer OSX/Linux to
Windows? I've run six businesses over the past decade and when windows 10
started rolling out I was in a Houston office that lost nearly a hundred
terabytes of client data. We did everything by the books, paid for the
business and enterprise editions of Windows, used their servers, used their
proprietary software stack, used their support service, and they still fucked
us and didn't really seem to care or think they did anything wrong.

I have another company that runs on a completely open stack, where pretty much
nothing is integrated by a specific vendor. We have hiccups, but we've never
had the OS get hijacked and upgraded.

I've noticed most start-ups run by devs run on a more open stack and hack
their way through problems on the cheap, and the ones run by corporate
executives try to keep things as closed as possible, but end up spending
millions to solve problems that they could have had some people solve for fun
on the internet.

I prefer to use the right tool for the right job, but I wish companies like
Microsoft would be more open when they cause huge issues that end up causing
monetary loss. I make sure all my critical infrastructure is open source these
days.

~~~
FireBeyond
"You didn't hear a peep from Microsoft when the forced windows 10 upgrade
bricked thousands of laptops. Perhaps that's why such a huge portion of
developers prefer OSX/Linux to Windows? "

Weird that you group OSX in there. The only company in mainstream tech more
secretive than Microsoft when it comes to problems is... Apple.

------
jwilk
[https://news.ycombinator.com/item?id=13537052](https://news.ycombinator.com/item?id=13537052)

~~~
willemmali
^ Related HN discussion on the live report @ Google Docs

------
pzh
Does nobody else find the report cringeworthy? Apparently, there are some
junior engineers fumbling around and committing serious errors, but where are
the senior ones and the process/failsafes to prevent all this?

~~~
jjirsa
Yes. It makes me angry. It's not that backups failed, it's more like broad
incompetence at all the wrong places.

------
lsh123
Many years ago in 96 or 97, my colleague tried to upgrade Oracle DB by running
"sudo /path/install.sh" from the root folder. Little did he know that the
script did "rm -rf *" on one of the first lines :)

The day was saved by the fact that Oracle stored data on a block device
directly. There were no data loss and we just had to restore the machine
itself.

Since that day, I never run any scripts in /, /etc/, ...

~~~
dredmorbius
I've always despised the practice of vendor-provided installation scripts for
much this reason.

Use local installers. Or tarballs.

------
bfrog
Testing backups or at least monitoring them for correctness is a huge deal and
it's problem I myself fudged up on one occasion, right around the holiday
season which was terrible.

I've since setup wal-e with daily base backups deleting things older than a
week along with the nightly pg_dump's along with a hot stanby. Maybe thats
overkill, but after having lost data once. Never again!

The nice part about doing wal archiving like barman or wal-e do is you can do
more than just backup/restore. You can do it with some time target in mind as
well.

Someone somehow do a massive update or delete or insert millions+ in garbage?
No worries, stop, destroy, restore to a previous point in time, continue
onwards.

A bug in postgres or the kernel or the filesystem or any other multi-million
line codebase in that stack screw up? Most likely the WAL segments are good up
to a point still.

Hot standby gets you potentially sub-minute failovers if you automate them, or
short enough to be ok even with manual failovers. WAL archiving gets you
another whole level of safety net that is hard to beat.

~~~
sgs1370
One thing I could never figure out, and it is probably because I didn't study
the docs enough, is when I could delete the archive files. What is the best
reference on this. OR, maybe I was trying to reinvent the wheel, I didn't
really look at 3rd party libs - only postgresql options (plus my glue on top)
- is the only right way to use the archive (file shipping) solution to combine
it with a 3rd party?

------
giancarlostoro
On the other hand, even if there was some data loss, shouldn't most people
have their entire repositories on their drives (at least the ones they're
actively working on) so in theory much can be recovered by end-users who are
active. The only true worry is inactive users. Not sure if this was discussed
much.

~~~
yoavm
AFAIK the problem was with the database and not the with the repos. So yeah, I
have all my files on my machines, but I don't have any copy of the issues,
merge requests, wiki pages etc'.

------
esseti
Aside from the incident, this is a great opportunity to learn something new
(at least for me). Said so, does anyone of you know how they plot all the
charts of postgres and co? such as this:
[http://monitor.gitlab.net/dashboard/db/postgres-
queries](http://monitor.gitlab.net/dashboard/db/postgres-queries) I know that
the chart is made in grafana, but: how do they collect the data?

~~~
sytse
We use Prometheus prometheus.io We have a team of Prometheus engineers and
have a vacancy for more.

------
aidenn0
We have a saying at my work:

"If it's not tested it doesn't work"

while this was originally talking about software, it's amazing how many other
places it applies. Do you require code reviews before commit? Periodically
sample a few random commits and see if one was done for each.

Heck, if you don't test your system for running automated tests, it may be
that you aren't even testing what you think you're testing.

------
carmate383
How can a "company" with more than 150 employees, (That is essentially, a data
storage company) let this happen?

GitLab is surely losing subscribers faster than land in Crimea after this...

~~~
cs02rm0
Incident upon incident followed by a mistake under pressure late at night.

\--it happens. I suspect the the guy responsible for the final straw is
feeling pretty bad. I know I've come close to doing similar things on
production environments I really didn't want to be touching while they were
falling apart.

But they've been honest about it. If they learn from it and six hours of
database data is the worst data loss they ever experience I think it'll be a
credit to them that they've been promptly transparent.

------
jbverschoor
Only pull requests and issues are gone. And even if the repo's were gone,
wasn't git supposed to be a distributed vcs anyway? ;)

------
slowhand09
TL;DR I'm sure its been said already. Doing backups is great. But testing
RECOVERY is critical and should be top priority. A data company is scary. When
you must back up your data because your data company can't be trusted to
backup your data...

------
da4c30ff
Git really should have issue and merge/pull request data shipped with the
repository. Does anyone know if this has been planned or not?

~~~
Ajedi32
[https://gitlab.com/gitlab-org/gitlab-
ce/issues/14924](https://gitlab.com/gitlab-org/gitlab-ce/issues/14924)

------
chaosfox
on related news, gitlab.com seems back up now.

------
thisisadumb
it's a good thing our data is always backed up by your friends at the nsa

~~~
PhantomGremlin
They'll back it up for you, but they won't give you access to your own data:
[https://memegenerator.net/instance/68192872](https://memegenerator.net/instance/68192872)

------
overcast
So the DeathStar is operational once again?

~~~
cabargas
it is a fully operational battle station once again :)

------
pdog
It feels like PostgreSQL is at the center of every terrible story about data
loss[1] or poor performance[2].

I think companies prefer other databases like MySQL because they _" just
work."_

[1]: [https://about.gitlab.com/2017/02/01/gitlab-dot-com-
database-...](https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-
incident/)

[2]: [https://eng.uber.com/mysql-migration/](https://eng.uber.com/mysql-
migration/)

~~~
sgarman
Seems like confirmation bias. I always think of nosql databases when I'm
thinking about terrible data loss stories, I especially remember the couchDB
one. Postgres has been nothing but amazing for my uses.

~~~
patmcguire
If you delete your data directory, you're going to have a bad time no matter
what your stack is.

~~~
VintageCool
Yes, it's probably an outage event, but then you flip over to your slave MySQL
instance and keep going on your merry way.

