
GitLab.com Downtime Postmortem - jobvandervoort
https://docs.google.com/document/d/1ScqXAdb6BjhsDzCo3qdPYbt1uULzgZqPO8zHeHHarS0/edit?usp=sharing
======
wpietri
The bit about the large repositories reminds me tangentially of something I
read about Amazon: when tuning for performance, they don't look at, say, the
median response time. They look at the 99.9% level.

If I recall rightly, their example was a customer searching their old orders.
The customers with the slowest response times were their very best customers,
and they wanted those people to have at least as good as an experience as the
median customer.

That definitely change my attitude to performance tests: now whenever I'm
picking metrics, I think hard the real-world implications of the levels I'm
setting.

~~~
sytse
Very true, and you can be sure we'll pay a bit more attention to the 0.1%
biggest repositories from now on.

------
felixgallo
One important learning from the postmortem: always set your servers up to be
in UTC rather than any other time zone. Helps debugging and log correlation
and eliminates confusion during incidents.

~~~
wpietri
Is that the lesson? Or just that everything should be in the same time zone?

If the company's staff is all in one time zone, I'm inclined to use that for
servers, as otherwise people have to mentally juggle two time zones: local and
UTC.

~~~
felixgallo
That's a good question. UTC is almost certainly still the best candidate even
in that case because:

* it's not affected by daylight savings time, which in many candidate timezones causes at least two complex and error fraught time discontinuities every year, and which requires manual intervention at the whims of the US Congress in the USA.

* UTC is the lingua franca of machine time communication -- if you ever have to, e.g., send data about transaction times, server event logs, order histories, etc., to any third party or API, then because there are 23 other timezones besides yours, and data normalization is important, they will almost certainly expect ISO8601 UTC. Yes, ISO8601 has time offsets. No, not all libraries implement them properly on either side.

* The first time you ever have logs from two different timezones -- e.g., a service provider log denominated in UTC and yours in PST -- you will hate computers so much that you will likely quit your job, throw your keys on the floor, and go run a hotdog stand down by the pier rather than deal with it. This is an honest response to time and localization issues but think of your children.

~~~
supine
...and just when you thought it was safe to go back outside leap seconds rear
their ugly head and ruin your day.

------
sytse
GitLab B.V. CEO here, please let us know if you have any questions (you can
also leave a comment or suggestion in the doc if you want). This whole real-
time postmortem is something we thought of to contribute back after having
downtime. Feel free to let us know what you think of it.

~~~
LeonM
It's only after I saw the "B.V." I realized Gitlab is a Dutch company, cool!
:)

Just one question, I saw a lot of logs being copy-pasted, aren't you concerned
about any security sensitive data leaking out?

~~~
sytse
Thanks! We are based in the Netherlands but we're mostly a remote company
[https://about.gitlab.com/2014/07/03/how-gitlab-works-
remotel...](https://about.gitlab.com/2014/07/03/how-gitlab-works-remotely/)

We're worried about sensitive data, see my other answer
[https://news.ycombinator.com/item?id=8003800](https://news.ycombinator.com/item?id=8003800)

------
sciurus
"6\. Production filesystem IS NOT mounted at this point

9\. gitlab-ctl start starts the GitLab with the staging data in
/var/opt/gitlab on the root filesystem that doesn’t have any production data.
At this point logs report that the production db doesn’t exist which is
correct because we are not on the production file system. No production data
has been touched at this point. The Gitlab web UI is not responding (502 error
from nginx)"

This is what strikes me as needing addressing. There shouldn't be staging data
that is normally hidden by mounting the production filesystem. If the
production database isn't there, Postgres should fail to start. The Postgres
team is pretty adamant that otherwise bad things can happen; see
[http://www.postgresql.org/message-
id/12168.1312921709@sss.pg...](http://www.postgresql.org/message-
id/12168.1312921709@sss.pgh.pa.us)

~~~
sytse
We fixed part of the root issue by restarting GitLab automatically when we
configure a server as master, preventing it from continuing with the staging
data. The staging data was created automatically by gitlab-ctl. I agree that
this is confusing and created the item: "Why did GitLab start when it
shouldn't?" in the postmortem. Thanks for raising this point.

------
pilif
I wonder what their drbd is using as its backing store and why it's needed at
all.

Personally, I honestly wouldn't trust to run a database on top of drbd and I'm
unsure whether using a drbd volume as a database directory will even cause the
other replica to have correct and usable data in case of an error.

Personally, I would use a real disk or maybe LVM as the store for the database
in order to reduce points of failures and remove less tested components from
the picture.

Then of course you need a database slave, but it looks like you had that
anyways which, again, leads me to question why drbd was even involved.

~~~
sytse
We use EXT4 on DRBD on LVM on RAID10. We already have DRBD for the git repo's
and prefer to reuse our experience for the database. There are other reasons
too some of which are articulated in
[http://wiki.postgresql.org/images/0/07/Ha_postgres.pdf](http://wiki.postgresql.org/images/0/07/Ha_postgres.pdf)

~~~
sytse
I gave a presentation yesterday about our HA setup
[https://docs.google.com/presentation/d/1gIzmp-d5X86jJMQz7Ixs...](https://docs.google.com/presentation/d/1gIzmp-d5X86jJMQz7Ixs_dzyhC1OTVK0w3uV_Rtc_Us/edit#slide=id.p)

------
fsiefken
Could this be caused by DRBD not reading the changes from the Postgresql
database quickly enough? Or is it some issue with the interaction between the
software RAID10, DRBD under high postgresql I/O load? What are the respective
versions of the kernel, database and DRBD? Are there disk I/O, network I/O and
cpu logs available of the time leading up to the crash?

~~~
sytse
We're not sure. There was a lot of disk locking going on at the time.

------
jbk
Well, this is not back for us: our repositories are corrupted...

~~~
sytse
I'm sorry to hear you are experiencing problems. This is the first report we
see of possible corruption. Please contact support@gitlab.com and note the
urls and commands that do not work for you. Edit: there another report I just
found: [https://gitlab.com/gitlab-com/support-
forum/issues/2](https://gitlab.com/gitlab-com/support-forum/issues/2)

------
sytse
We're talking a break until 16:00 CEST, many questions are answered already
and TODO's are given after the => in the document.

~~~
sytse
We're back again.

~~~
sytse
And we're done for today, thanks for stopping by everyone.

