
Scaling the GitLab database - fanf2
https://about.gitlab.com/2017/10/02/scaling-the-gitlab-database/
======
vincentdm
Funny thing is that they publicly own up to their performance problems in an
unusual way: in their comparison with GitHub
([https://about.gitlab.com/comparison/](https://about.gitlab.com/comparison/))
they list "Fast page load" as a feature that GitLab lacks and GitHub has.

Nevertheless, the slowness is really annoying, especially because their
product is so good on all other accounts. If scaling their database can help
speed things up, I bet they will be glad to remove this embarrassing "missing
feature".

In marketing terms, having fast page load would be called a "qualifier". For
example: you expect a hotel to provide toilet paper. You won't pick any hotel
because of it, but you will surely avoid one that doesn't.

~~~
sytse
The good news is that page loads are much better now. Our initial page load
ping
[http://stats.pingdom.com/81vpf8jyr1h9/1902794](http://stats.pingdom.com/81vpf8jyr1h9/1902794)
is better than GitHub.com
[http://stats.pingdom.com/81vpf8jyr1h9/1902795](http://stats.pingdom.com/81vpf8jyr1h9/1902795)

We got work to do in the 99% and merge request page load but the overall
situation has improve dramatically.

We still got work to do in availability, so I changed the 'feature' to reflect
this [https://gitlab.com/gitlab-com/www-gitlab-
com/commit/3b3bddf5...](https://gitlab.com/gitlab-com/www-gitlab-
com/commit/3b3bddf51e47f51e97a03fbb1a1ae8f40fdfd279)

~~~
pubgcrawler
Maybe I just can't find it, but what URLs are pingdom.com checking? Do both
issues have the same content? Or does one have 50 comments and another has 1?

~~~
sytse
These are checking
[https://github.com/gitlabhq/gitlabhq/issues/1](https://github.com/gitlabhq/gitlabhq/issues/1)
and [https://gitlab.com/gitlab-org/gitlab-
ce/issues/1](https://gitlab.com/gitlab-org/gitlab-ce/issues/1)

The one on GitLab is considerably longer.

------
twojustin
My team is currently using the hosted cituscloud, which uses PG Bouncer. They
note that the reason for sharding is because of high writes. We've actually
seen big benefits for moving over to a sharded setup via Citus just as much
for the read performance.

By sharding by customer we're able to more effectively leverage Postgres
caching and elastically scale. Since switching over, our database has
performed and scaled much better than when we were on a single node.

There is some up front migration work. But that's limited some minor patching
of activeRecord ORM, and changes to migration scripts. We considered casandra,
elastic search, as well as dynamoDB, but the amount of migration changes in
the application logic would take an unacceptable amount of time.

~~~
craigkerstiens
Craig from Citus here. Thanks for the kind words Justin, as you mention there
is some upfront work, but once it's in place it becomes fairly manageable.

We've been working to make that upfront work easier as well with libraries
that allow things to be more drop-in (ActiveRecord-multi-tenant:
[https://github.com/citusdata/activerecord-multi-
tenant](https://github.com/citusdata/activerecord-multi-tenant) and Django-
multitenant: [https://github.com/citusdata/django-
multitenant](https://github.com/citusdata/django-multitenant))

~~~
contrahax
Praying for a [http://docs.sequelizejs.com/](http://docs.sequelizejs.com/)
plugin

------
RandomBK
This article is also very useful in showing just how far you can push Postgres
_without_ reaching for any of these optimizations. I've seen too many projects
worry about these things very early on in their lifecycle, when in reality
they are no where close to having enough traffic to cause a problem.

~~~
nickh9000
The whole point why you should worry from the beginning is so that you don't
have to re-work everything and put a huge risk to the business when you have
to do it.

~~~
jsmeaton
Instead you appear to be advocating for over engineering a solution for a
problem you don’t yet have where the opportunity costs are probably features
customers can use. That is also a huge risk to a business, but one that is
_immediate_.

~~~
nickh9000
Thinking about how you would shard your data is hardly over engineering. On
some storage systems you are forced to anyway: Cassandra, dynamodb.

And the cost of building the solution from the onset is hardly a killer. The
pain usually comes from operational complexity.

------
nimbius
for anyone new to gitlab, it is a bear. even their own cluster/HA
documentation is mostly just a thin roadmap to a final goal. a statement of
intent if you will. There are just far too many moving parts in this software
and the glaring problem of "ruby doesnt scale well" is everpresent. packaged
releases depend entirely on a bundled chef installation to get everything out
the door and running.

The software tries to do too much. CI/wiki/code review/git/container
repository/analytics package.

CI sharding per team and carving up gitlab into departments and groups has
been my only solution so far, and its just distributed risk at that. its the
difference between the unix ethos of do one thing and do it well, versus the
F35 approach of do everything forever.

~~~
YorickPeterse

        > and the glaring problem of "ruby doesnt scale well" is everpresent
    

GitLab has many performance problems (and many have been solved over the
years), but Ruby has thus far not been one of them.

~~~
GrayShade
I remember installing GitLab on a small NAS-type box. It was a while ago, but
on each start-up it ran a nodejs tool to pre-compile some assets, I suppose.
On that machine, it took 10-15 minutes to start. Afterwards, the unicorn
workers kept getting killed because the default memory limit (128 MB) wasn't
enough to process more than literally a couple of requests. It did work, but
pages took 2-5 seconds to load, which I couldn't stand.

For anyone needing a small Git Web UI, I suggest [https://github.com/go-
gitea/gitea](https://github.com/go-gitea/gitea).

EDIT: Oh, you're that guy ^^;.

Anyway, my rant was probably a bit harsh. I know that GitLab isn't really
written for hardware comparable to a Raspberry Pi and that it's used by small
and large teams all over the world. But on my hardware Gitea renders pages in
20-100 ms, which is orders of magnitude faster than GitLab. I'm sure part of
this is the fault of the language.

~~~
nisa
So just use cgit on your 128mb box and don't complain about gitlab? It's like
installing Windows 10 on a machine with 1gb memory, starting Photoshop and
complaining about swapping and that it's running slow.

Gitlab runs fine out of the box with 4gb memory and when screwing with some
provided knobs it's fine with 2gb or less memory.

It's in the docs, it's in every FAQ, every Stackoverflow answer... so why the
hate?

64mb memory? Use gitolite. 256mb memory? Use cgit + gitolite 1gb memory? use
gitea or whatever else there is. 4gb memory? use gitlab if you want.

I'm not affiliated but I can't stand the "gitlab doesn't run on my home wifi
router - it sucks" posts here... it works pretty well if you read &
acknowledge the documentation and requirements.

~~~
the_duke
I'm running a Gitlab instance for a smallish dev team of 10.

I haven't gotten it to run reliably with any less than 8Gb, even though we
don't have a whole lot of activity there.

It takes surprisingly long to start up too, even on a relatively beefy host.

I do like the product though, the all-in-one solution with code hosting,
issues, code review (needs work though...) and CI is great. (Not using the
deployment, orchestration stuff).

My biggest issue is that they seem to be too focused on pumping out new
features.

There are 9k open issues just for CE. That's probably 12k+ for all products
(runner, EE,...).

That either is a sign of very bad project management or a way to big agenda.
Probably both.

~~~
sytse
I'm surprised to hear GitLab isn't running reliably with 8GB. Are you running
the Omnibus installations?

The open issues for CE [https://gitlab.com/gitlab-org/gitlab-
ce/issues](https://gitlab.com/gitlab-org/gitlab-ce/issues) do include feature
requests.

~~~
the_duke
I'm using the official docker image.

It's been very pain-less in setup and maintainance. Upgrades were mostly
painless as well until now.

With 8GB for the docker container it's solid and pretty stable, but any less
and behaviour would be very unstable (timeouts, very long request load times,
...).

Of course those 8GB are for the whole stack (Postgres, Redis, Nginx, Unicorn
instances, ...), so I'll admit that 8GB for the __Docker container __is quite
important context here. ;)

~~~
sytse
Thanks for the answer. I would have thought that 4GB is enough. Thanks for
providing a data point that it might not be.

------
tschellenbach
Running PG bouncer is a very basic optimization. You typically start out with
PG bouncer in your stack if you have experience running Postgres.

If you're new to running your own postgres databases you should also check out
Wall-e: [https://github.com/wal-e/wal-e](https://github.com/wal-e/wal-e) And
the awesome pg stat statements
[https://www.postgresql.org/docs/10/static/pgstatstatements.h...](https://www.postgresql.org/docs/10/static/pgstatstatements.html)

~~~
craigkerstiens
If you are already familiar with wal-e, or even if not, you might want to
consider taking a look at wal-g[1]. Wal-g is a newer edition of wal-e written
in go that we've seen can have up to 7x performance improvements[2].

[1] [https://github.com/wal-g/wal-g](https://github.com/wal-g/wal-g)

[2] [https://www.citusdata.com/blog/2017/08/18/introducing-
wal-g-...](https://www.citusdata.com/blog/2017/08/18/introducing-wal-g-faster-
restores-for-postgres/)

~~~
djd20
While wal-g looks great, having evaluated it a bit, I worry that nobody seems
to be maintaining it based on issues/pr's I looked at.

~~~
djd20
Scratch that - I see fresh commits - will give it another spin!

------
karmakaze
Other options available:

\- CockroachDB 1.1

\- AWS RDS Aurora PostgreSQL-compatibility

[1]
[https://news.ycombinator.com/item?id=15458900](https://news.ycombinator.com/item?id=15458900)
[2]
[https://news.ycombinator.com/item?id=13072861](https://news.ycombinator.com/item?id=13072861)

~~~
elvinyung
CockroachDB is a really cool product, but I think by all accounts it is still
nowhere near performant enough to be used in production.

~~~
forgot-my-pw
Alternative is Google Spanner. Also still new, but already in GA for 5 months
now.

~~~
qaq
The spanner setup that can match PG on a high end x86 cluster will run way
over 100K month

------
maktouch
We're moving away from gitlab because the performance is too much to bear.
We're hearing "aaaaaarg I really hate gitlab" a few times a day.

The pipeline pages takes 7 seconds to load. It's really bad. We stopped using
issues because just listing them was a chore. Pushing 1 file takes at least 30
seconds to 1 minute.

Anyways, the work falls on me to replace it. We're going with Phabricator and
Jenkins.

~~~
josegonzalez
Is that self-hosted Gitlab or their hosted version? Would you mind sharing
details on your setup? Just curious as we're currently building out our own
setup.

~~~
maktouch
Right now:

\- Self hosted Gitlab CE

\- Gitlab runners in pre-emptibles nodes with autoscaling

Future (almost done, still in testing phase):

\- Phabricator (for git, code review and issues)

\- Jenkins ("lightweight" CI, more details at the bottom)

\- Google Cloud Container Builder ("Heavyweight" CI)

Our main repository is in Phabricator, but it has mirroring to Google Cloud
Repository (for backup and faster heavyweight builds)

On commits, we run the lightweight CI in Jenkins. What I mean by lightweight
is that, there's only a few runners and what it does is check what has
changed. We have a mono-repository with over 30 microservices, rebuilding all
of them is a waste of time and resources, so we have scripts and easy to use
manifests that defines what to do. We also have a lot of small CI checks that
is just not worth it to run externally, like linting.

Here's what our homegrown manifests looks like

    
    
      build_graphql:
        when_changed:
          - graphql
          - node_libraries/player-graphdb
          - node_libraries/player-auth
          - node_libraries/player-models
        script_build: graphql/cloudbuild.yaml
        script_skip: graphql/cloudbuild-skip.yaml
    

The script skip takes the last successful build and retag the docker images to
the current one. The build script sends the instructions to Google Cloud
Container Builder to build the docker image, run the tests, and push it. We
use Cloud Container Builder because managing those CI servers really sucks.

Let me know what else you want to know.

------
ericfrederich
Sounds like a design flaw... adding a centralized database to augment a
decentralized version control system.

Comments, issues, pull requests, wiki, etc... should all be first class Git
objects. I should be able to create pull requests while offline and push them
up when I'm back online.

~~~
stephenr
Issues can be in dvcs with something like BugsEveryehere.

Git _has_ native "pull requests" it's just not web based, it's email based.

A good wiki is markdown docs in a git/hg repo and something like Gollum to
serve web viewers.

So - in a way those things can all be handled as regular content in git/hg.

------
vthriller
> A side effect of using transaction pooling is that you cannot use prepared
> statements, as the PREPARE and EXECUTE commands may end up running in
> different connections.

There is a way to use prepared statements, it's called pre_prepare [0]. You
still need to tweak your code though to not execute PREPARE directly but
rather put statements in special table, but you do that only once, and after
that you EXECUTE like before. Another minor (in my view) caveat is that you
need to reset pgbouncer↔postgres connections each time you update that
statements table.

[0]
[https://github.com/dimitri/preprepare](https://github.com/dimitri/preprepare)

------
mbid
Simple -- just clean the database from time to time [1].

BTW, I like how the banner on the bottom of [1] wants me to "Try GitLab
Enterprise Edition risk-free for 30 days." currently.

[1] [https://about.gitlab.com/2017/02/10/postmortem-of-
database-o...](https://about.gitlab.com/2017/02/10/postmortem-of-database-
outage-of-january-31/)

------
dantiberian
> Sharding would also affect the process of contributing changes to GitLab as
> every contributor would now have to make sure a shard key is present in
> their queries.

Why is this a problem? If you need to make a change to improve scalability, it
doesn't seem unreasonable to ask contributors to follow code and performance
guidelines.

------
bzillins
Looking through the public Grafana dashboards I noticed that the database
backups page is not showing any data points
([http://monitor.gitlab.net/dashboard/db/backups](http://monitor.gitlab.net/dashboard/db/backups)).
Is this meant to a public page?

~~~
sytse
This is meant to be a public page. It loads async. After 3 seconds I see the
following
[https://www.dropbox.com/s/5mt5f77e60yogk1/Screenshot%202017-...](https://www.dropbox.com/s/5mt5f77e60yogk1/Screenshot%202017-10-30%2015.50.05.png?dl=0)

------
justinclift
Not seeing any mention of using something like Memcached (or anything similar)
in front of PG for caching.

Is that because it's already been done and not in scope?

Asking because for (at least) web applications, using Memcached (or similar)
can significantly reduce the load on the backend database. :)

~~~
sytse
We have a lot of caching in Redis, but that connects with the Rails app, not
in front of PG. This way it can store the result (for example a page fragment)
instead of just the db query that is part of it.

~~~
justinclift
Cool, just making sure. :)

------
knownothing
It's interesting to see what parts of the product GitLab is struggling with
compared to GitHub, e.g. [https://githubengineering.com/stretching-
spokes/](https://githubengineering.com/stretching-spokes/)

~~~
sytse
I'm very impressed with the level GitHub database testing is at
[https://githubengineering.com/mysql-testing-automation-at-
gi...](https://githubengineering.com/mysql-testing-automation-at-github/) We
still got a lot of work ahead of us. But Yorick and Greg their work has been
outstanding.

Regarding Spokes we're not planning something like that at this time. From
[https://gitlab.com/gitlab-org/gitaly/issues/650](https://gitlab.com/gitlab-
org/gitaly/issues/650) "GitHub eventually moved to Spokes
[https://githubengineering.com/building-resilience-in-
spokes/](https://githubengineering.com/building-resilience-in-spokes/) that
had multiple fileservers for the same repository. We will probably not need
that. We run networked storage in the cloud where the public cloud provider is
responsible for the redundancy of the files. To do cross availability zone
failover we'll use GitLab Geo."

~~~
knownothing
Thanks for chiming in, sytse and YorickPeterse!

Re: misuse, I remember this incident
[https://news.ycombinator.com/item?id=11245652](https://news.ycombinator.com/item?id=11245652)
made me chuckle and cry a little. If you build it, it will be misused.

------
d4l3k
Did you consider trying NewSQL DBs that would let you scale horizontally with
much less complexity? Curious to see if the performance is comparable yet.

~~~
YorickPeterse
No, because it probably would have required a complete rewrite of GitLab to
make it work.

~~~
kodablah
User mentioned "trying". With an app that gets this much use, it might be
reasonable to take a small piece and fork all writes to both the current + an
experimental DB and then A/B test the reads. If this type of abstraction
requires "a complete rewrite of GitLab" for "trying NewSQL DBs" then something
is amiss.

~~~
mnutt
Given that Gitlab is open source, adding a NewSQL DB into the mix would be
pushing that dependency to downstream users, who would then have to take on
the maintenance burden. It may end up being the right call from a performance
standpoint but just too complex to ask users to manage.

~~~
kodablah
This burden already exists with dependencies (and we should probably say
downstream "admins" instead of "users" to clarify). A self-hosted app should
not be hamstrung by its initial dependency choices. If tested and implemented
properly it can go from optional experimental flag, to multiple-storage-
backends-supported abstraction, to [maybe] even we're-migrating-to-new-
storage-backend, etc. The primary downside in these cases is the flexibility
often provided to plugin and extension writers which marries them to early
tech choices. The more admins that demand low change count on forward movement
are indirectly the reason many companies choose to offer SaaS only where the
visibility and flexibility are removed.

Note, I'm not saying this is the case with GitLab at all and I assume Postgres
will remain the primary choice for most of their DB uses in perpetuity, but
for some use cases there is better tech to use and the only reason to say no
shouldn't just be a dependency addition (though it is often a good reason
among many to say no).

------
nickh9000
Apparently I should stop using gitlab.com By choosing to not shard your data
you have basically put a cap on scalability... or at least one which will be
reached faster than if you were to decide to shard.

~~~
awj
Web service architectures are like shoes. You want what fits now, not what you
hope will fit later.

Chances are very good that preemptive scaling will just result in unnecessary
development overhead and an architecture/set of abstractions that _still_ fail
to handle the expected load since they weren't battle tested against it.

------
tkyjonathan
I am a bit disappointed at this article and the hierarchy of how to optimise
postgres. As a MySQL specialist that has worked on a few Postgres projects, I
can tell you that Postgres has a VERY RICH toolkit to speed things up. Between
optimising queries and load balancing, you have indexing, partitioning..
(Postgres even allows you to index EACH partition differently) and triggers to
do summary/rollup tables for you.

Its just that Postgres may need a little more coding and is less automatic
than MySQL that people shy away from it..

I do like the time and information spent about spreading load accross database
servers and not nearly enough people are considering splitting reads and
writes from a master/slave setup.

Its just that its frustrating to me that engineers view databases as some sort
of mysterious black box that you need to solve by working around it or
replacing it entirely.

------
deepsun
> 2\. Use a connection pooler ...

I had my jaw dropped on that point -- they didn't use connection pooling? WTF?
I'm seeing connection pools on pretty much any project I've worked on, even
under 1/req/day load, it's pretty much "must have" since 90s. And it's really
just a library (like dbcp2) and a few lines of config, with almost no
drawbacks.

I cannot believe engineers at GitLab level are so bad at, well, engineering.
I'm losing faith in their produce, please show me that I'm wrong.

~~~
YorickPeterse
We've had connection pooling on the client side since forever, just like any
other Rails project. This article specifically talks about server side
pooling.

------
mschuster91
> A side effect of using transaction pooling is that you cannot use prepared
> statements, as the PREPARE and EXECUTE commands may end up running in
> different connections; producing errors as a result.

Uhhh... WHAT? I'll simply sit by and watch for SQL injection vulnerabilities
resulting from this change. Either you have a _very_ good SQL query writer
engine with rock solid escaping or you _will_ get pwned by this.

~~~
YorickPeterse
You can have perfectly valid input escaping without needing prepared
statements. Many ORMs handle this without the need for PREPARE/EXECUTE.

~~~
mschuster91
> You can have perfectly valid input escaping without needing prepared
> statements.

People have bypassed this far too often for my taste, and there is a reason
why SQL (and other injections) are Top 1 of OWASP (1st place actually, both in
RC1 and 2 of 2017, but also in the 2013 edition).

Never, ever trust any form of escaping or it _will_ bite you. Hard.

~~~
ris
> Never, ever trust any form of escaping or it will bite you. Hard.

This is an absurd dogmatic statement. Everybody trusts a multitude of
escapings everywhere every day all over the web without even knowing it.

