
GitLab is down. Again. - openbasic
https://twitter.com/gitlabstatus/status/989441724836667392
======
foobarbazetc
A bunch of thoughts from a guy running postgres on a large site:

A restore rate of 100Gb/hour is way too low.

Over a 10GigE interface with a semi-decent SSD RAID you should be pushing
1Gb/second _easily_.

Also, pgbouncer is never the problem. If it’s complaining it’s either
configured wrong or the database is having a bad time.

I’d be scared out of my mind if this was my service and I was running it all
on a single master with no usable backups. Then making it worse by letting
production traffic hit it thereby moving the database further and further from
when the isssue occurred.

Those snapshots are going to be unusable unlsss the underlying FS is frozen
before the snapshot gets taken.

Honestly reading the notes (and not knowing anything about the team) they need
less DevOps and more SysAdmin/DBA.

~~~
jfroma
> Honestly reading the notes (and not knowing anything about the team) they
> need less DevOps and more SysAdmin/DBA.

Having had problems in the past with Postgres and managing their database[1] I
wonder why they haven't switch to a Postgres-as-a-service solution like Google
Cloud SQL or Amazon RDS. It seems they are even running the site in their own
hardware [2].

[1]: [https://about.gitlab.com/2017/02/10/postmortem-of-
database-o...](https://about.gitlab.com/2017/02/10/postmortem-of-database-
outage-of-january-31/)

[2]: [https://about.gitlab.com/2016/12/11/proposed-server-
purchase...](https://about.gitlab.com/2016/12/11/proposed-server-purchase-for-
gitlab-com/)

~~~
edjboston
GitLab VPE here. We apologize for the site performance degradation today.

We do have a dedicated DB team of three people. We also have an open vacancy
for the team's manager and and individual contributor.

We are in the midst of a move from Azure to GCP. That means more work than
usual is going on currently because we're replicating data across
infrastructure-as-a-service providers and keeping the site running. This has
been the case for a couple months now and it's shouldn't impact site
performance. Today's site slowness was unfortunately due to a manual mistake
related to a Postgres upgrade task.

[2] This article is old. A move to metal never happened, and is not in our
future plans. In fact, after we move to GCP we are likely to set up a CloudSQL
replica and evaluate it's performance

------
reacharavindh
Serious question: Would it be better if we had services that did not "move
fast and break things"? particularly when dealing with user facing components
like this?

Imagine a gitlab product that goes into operation, and gets updated with a
good load of features every 3 or 6 months? The ones they replace with is
thoroughly tested internally?

~~~
edjboston
GitLab VPE here.

Historically the dominant source of outage minutes is indeed features that
didn't scale (70%).

However, We've made great strides in the past 6 months on QA and release
management and it's yielded a marked improvement in availability. The last
week has been an exception to that.

We're in the midst of a move from Azure to GCP, and once that is done, we're
going to rebuild out system to be entirely automated which will eliminate a
class of manual mistakes.

