Funny thing is that they publicly own up to their performance problems in an unusual way: in their comparison with GitHub (https://about.gitlab.com/comparison/) they list "Fast page load" as a feature that GitLab lacks and GitHub has.
Nevertheless, the slowness is really annoying, especially because their product is so good on all other accounts. If scaling their database can help speed things up, I bet they will be glad to remove this embarrassing "missing feature".
In marketing terms, having fast page load would be called a "qualifier". For example: you expect a hotel to provide toilet paper. You won't pick any hotel because of it, but you will surely avoid one that doesn't.
We just kicked off a major effort to address the availability issue Sid mentions above. The highlights are that we're moving to GCP which should provide better underlying reliability. But interestingly we found that only about ~20% of our downtime minutes where from underlying infrastructure. Whereas ~70% came from features that didn't scale.
So the more exciting part of the project is to tighten the feedback loop between development and deployment with a continuous delivery pipeline. This may be obvious to some people, but it's harder to pull off when you've got an open source project, an on-prem product, and a large-scale SaaS sharing the same code base. I'm calling it "Open-core SaaS" and there are only a handful of companies that run a large, multi-tenant service based on an open source project.
Maybe I just can't find it, but what URLs are pingdom.com checking? Do both issues have the same content? Or does one have 50 comments and another has 1?
wow that a lot higher that i would have expected. Linked bitrise page used " randomly selected 10,000 apps as a base, that are getting built regularly on Bitrise" , not "enterprises".
How did you comeup with 2/3 number? First line of your about page is a lie?
I love that they do though... It's so honest, and it gives me the feeling that I can trust that that might be the biggest of their problems, and not something else they don't want to talk about it.
My team is currently using the hosted cituscloud, which uses PG Bouncer. They note that the reason for sharding is because of high writes. We've actually seen big benefits for moving over to a sharded setup via Citus just as much for the read performance.
By sharding by customer we're able to more effectively leverage Postgres caching and elastically scale. Since switching over, our database has performed and scaled much better than when we were on a single node.
There is some up front migration work. But that's limited some minor patching of activeRecord ORM, and changes to migration scripts. We considered casandra, elastic search, as well as dynamoDB, but the amount of migration changes in the application logic would take an unacceptable amount of time.
Craig from Citus here. Thanks for the kind words Justin, as you mention there is some upfront work, but once it's in place it becomes fairly manageable.
One of the things that I'm now excited about for citus is multi node IO capacity. It should improve write and read latencies in the situation where a single node disk is getting saturated.
* Edit. I AM excited for this. Typo'd now = not previously
Let's say that you're using RDS and have a single box capable of ~25k IOPS. If you have 16 boxes capable of some number of IOPS (let's say 15k), then your total system capacity is significantly higher than 25k. On read workloads where pages need pulled from disk (high read IOPS), this should see improvement in general.
Secondaries cover this case for Gitlab, it seems, but that comes with a set of caveats as well (namely async availability of data).
> Secondaries cover this case for Gitlab, it seems, but that comes with a set of caveats as well (namely async availability of data).
Another caveat is that all secondaries will end up having more or less the same stuff in their cache. As your data set grows bigger that becomes unsustainable, because you're going to read from disk more and more.
When you shard across N nodes you can keep N times as much data in the cache. Combined with N times higher I/O and compute capacity, that can actually give you way more than N times higher read throughput than a single node (for data sets that don't fit in memory), and you can get much higher write throughput as well.
This article is also very useful in showing just how far you can push Postgres _without_ reaching for any of these optimizations. I've seen too many projects worry about these things very early on in their lifecycle, when in reality they are no where close to having enough traffic to cause a problem.
Yes, you can typically push PostgreSQL very far while still using a fairly simple setup (e.g. no sharding). Unfortunately too many times people have this mindset that a slow application is the result of a slow/bad database (as in "it's an RDBMS and RDBMS' don't scale"), and not the result of it being misused (e.g. badly written queries, lack of proper indexes, that sort of thing). At GitLab it took quite a while to get everybody to see this.
1. Write web application with little to no idea how an RDBMS works
2. Run into inevitable performance problems
3. Decide that your database is at fault, rewrite in NoSQL
4. Buy into NoSQL hype, push 5x resources on your NoSQL
solution
5. Profit?
There are NoSQL options that are great for lots of things. It takes some significant expertise to know if your thing is one of those things. Expertise you probably don't have if you don't even have a moderately deep knowledge of RDBMS-es.
And almost certainly you can push postgresql even more by throwing hardware (ie. RAM) at the problem, although this is probably the point when it stops being worthwhile for reasonably sane applications (I've seen deployments where single pgsql server with 2TB+ RAM was cheaper solution than this kind of optimisation, due to all the technical debt in applications)
Is easy to overlook how much you can do with a decent RDBMS. And for the small data that gitlab use, I believe still exist a lot of big wins on performance.
Is just that the new generation not pay much attention to Sql databases...
P.D: I don't mean the gitlab developers, just on general
The whole point why you should worry from the beginning is so that you don't have to re-work everything and put a huge risk to the business when you have to do it.
You never actually know where your bottlenecks are going to be until they arrive and by designing everything with a "super scalable" architecture you will be making development ten times as painful and expensive as it needs to be while throwing away nice things that come "for free" and "just work" at the mid-low end like transactions.
Amazon and Netflix don't want to have to use their hideously complex and inefficient service architectures - they're forced to because of their scale.
Most people who engineer their systems for hyper-scale from the get-go never see a whiff of anything that looks remotely like high traffic. Often they go out of business before they get anywhere near that.
And once you get to serious scale, you really shouldn't still be running the code from back when you didn't really know what your business/product was.
Certainly. However, there's a difference between keeping scalability in mind, and actually implementing/maintaining a complex database scaling system at the onset of a project.
Edit: I would also add that most projects I've seen tend to undergo one or more large refactors / redesigns before growing to a size where complex DB scaling is needed. This is, of course, speaking from the perspective of a small startup or pet project. If you're writing a new feature for an existing product where you can assume that it'll have lots of users, then you would obviously build scaling right from the start.
Instead you appear to be advocating for over engineering a solution for a problem you don’t yet have where the opportunity costs are probably features customers can use. That is also a huge risk to a business, but one that is immediate.
for anyone new to gitlab, it is a bear. even their own cluster/HA documentation is mostly just a thin roadmap to a final goal. a statement of intent if you will. There are just far too many moving parts in this software and the glaring problem of "ruby doesnt scale well" is everpresent. packaged releases depend entirely on a bundled chef installation to get everything out the door and running.
The software tries to do too much. CI/wiki/code review/git/container repository/analytics package.
CI sharding per team and carving up gitlab into departments and groups has been my only solution so far, and its just distributed risk at that. its the difference between the unix ethos of do one thing and do it well, versus the F35 approach of do everything forever.
I remember installing GitLab on a small NAS-type box. It was a while ago, but on each start-up it ran a nodejs tool to pre-compile some assets, I suppose. On that machine, it took 10-15 minutes to start. Afterwards, the unicorn workers kept getting killed because the default memory limit (128 MB) wasn't enough to process more than literally a couple of requests. It did work, but pages took 2-5 seconds to load, which I couldn't stand.
Anyway, my rant was probably a bit harsh. I know that GitLab isn't really written for hardware comparable to a Raspberry Pi and that it's used by small and large teams all over the world. But on my hardware Gitea renders pages in 20-100 ms, which is orders of magnitude faster than GitLab. I'm sure part of this is the fault of the language.
So just use cgit on your 128mb box and don't complain about gitlab? It's like installing Windows 10 on a machine with 1gb memory, starting Photoshop and complaining about swapping and that it's running slow.
Gitlab runs fine out of the box with 4gb memory and when screwing with some provided knobs it's fine with 2gb or less memory.
It's in the docs, it's in every FAQ, every Stackoverflow answer... so why the hate?
64mb memory? Use gitolite.
256mb memory? Use cgit + gitolite
1gb memory? use gitea or whatever else there is.
4gb memory? use gitlab if you want.
I'm not affiliated but I can't stand the "gitlab doesn't run on my home wifi router - it sucks" posts here... it works pretty well if you read & acknowledge the documentation and requirements.
I'm running a Gitlab instance for a smallish dev team of 10.
I haven't gotten it to run reliably with any less than 8Gb, even though we don't have a whole lot of activity there.
It takes surprisingly long to start up too, even on a relatively beefy host.
I do like the product though, the all-in-one solution with code hosting, issues, code review (needs work though...) and CI is great. (Not using the deployment, orchestration stuff).
My biggest issue is that they seem to be too focused on pumping out new features.
There are 9k open issues just for CE. That's probably 12k+ for all products (runner, EE,...).
That either is a sign of very bad project management or a way to big agenda. Probably both.
It's been very pain-less in setup and maintainance. Upgrades were mostly painless as well until now.
With 8GB for the docker container it's solid and pretty stable, but any less and behaviour would be very unstable (timeouts, very long request load times, ...).
Of course those 8GB are for the whole stack (Postgres, Redis, Nginx, Unicorn instances, ...), so I'll admit that 8GB for the Docker container is quite important context here. ;)
My box has 16 GB of memory. The 128 MB value was the default value configured in unicorn. If you're not familiar with that (I wasn't), it has a feature to restart the web server workers when they end up using too much memory. That's what unicorn does and apparently it's what people do in Ruby web apps.
I've also used GitLab on a (real) server. It still feels too slow for my taste (pages load in 1-2s). No, it's not unusable, but I won't call that "fine".
But be aware, gitea (like gogs) does NOT cache SQL queries, so you are heavily limited by that. I can’t get it to serve > 200 pageloads per second on my system, while even a normal Grails projects manages to serve 3000 (with more complicated queries).
CI/CD that is very easy to use. I can self host it - so no $600 a month contract like some of the other guys. Decent Issues that are not as heavyweight as Jira. Place for code / CI/CD / Issues to all be in one place.
If you are already familiar with wal-e, or even if not, you might want to consider taking a look at wal-g[1]. Wal-g is a newer edition of wal-e written in go that we've seen can have up to 7x performance improvements[2].
I really wanted to use wal-g, but they had NO support for any backend storage besides S3, and did not appear to have any interest in developing more backends.. (We use GCP)
I’ve set up both pgbouncer and wal-e in the past, and they’re both worth it to have from effectively the very beginning.
But it leaves me wondering: is there any sort of packaging of Postgres and these “friends” into a single opinionated virtual appliance, such that I could just stick up a couple of VMs with the same image with different role tags, and get a good cluster with automatic transparent N:M proxying, automatic transparent backup-and-restore, etc?
In other words, is there a product that is to Postgres as Gitlab CE is to Github (or like Dokku is to Heroku): a “host it yourself, but easily, and starting from a scale of 1 with no overhead” equivalent of a SaaS service?
If you're deploying on K8S at least Crunchy Data has you covered [0], I haven't used it myself but it's the closest I've been able to find without going with something like EnterpriseDB.
I'm pretty sure there's some ansible/puppet/chef/saltstack code out there to build something similar as well.
Or, you know, use something like Cassandra that’s used at massive scales already (by Apple, Netflix, instagram, uber, Walmart, IBM, etc), and scales linearly to make future growth trivial.
Bonus points that rm -rf on a Cassandra server isn’t a problem, which seems like it could have been useful for gitlab.
We're moving away from gitlab because the performance is too much to bear. We're hearing "aaaaaarg I really hate gitlab" a few times a day.
The pipeline pages takes 7 seconds to load. It's really bad. We stopped using issues because just listing them was a chore. Pushing 1 file takes at least 30 seconds to 1 minute.
Anyways, the work falls on me to replace it. We're going with Phabricator and Jenkins.
Is that self-hosted Gitlab or their hosted version? Would you mind sharing details on your setup? Just curious as we're currently building out our own setup.
- Gitlab runners in pre-emptibles nodes with autoscaling
Future (almost done, still in testing phase):
- Phabricator (for git, code review and issues)
- Jenkins ("lightweight" CI, more details at the bottom)
- Google Cloud Container Builder ("Heavyweight" CI)
Our main repository is in Phabricator, but it has mirroring to Google Cloud Repository (for backup and faster heavyweight builds)
On commits, we run the lightweight CI in Jenkins. What I mean by lightweight is that, there's only a few runners and what it does is check what has changed. We have a mono-repository with over 30 microservices, rebuilding all of them is a waste of time and resources, so we have scripts and easy to use manifests that defines what to do. We also have a lot of small CI checks that is just not worth it to run externally, like linting.
The script skip takes the last successful build and retag the docker images to the current one.
The build script sends the instructions to Google Cloud Container Builder to build the docker image, run the tests, and push it. We use Cloud Container Builder because managing those CI servers really sucks.
Sounds like a design flaw... adding a centralized database to augment a decentralized version control system.
Comments, issues, pull requests, wiki, etc... should all be first class Git objects. I should be able to create pull requests while offline and push them up when I'm back online.
I would rather baseline git be light and barebones - all the extra stuff is nice to have from a provider, but being part of git itself would make it bloat out.
> A side effect of using transaction pooling is that you cannot use prepared statements, as the PREPARE and EXECUTE commands may end up running in different connections.
There is a way to use prepared statements, it's called pre_prepare [0]. You still need to tweak your code though to not execute PREPARE directly but rather put statements in special table, but you do that only once, and after that you EXECUTE like before. Another minor (in my view) caveat is that you need to reset pgbouncer↔postgres connections each time you update that statements table.
> Sharding would also affect the process of contributing changes to GitLab as every contributor would now have to make sure a shard key is present in their queries.
Why is this a problem? If you need to make a change to improve scalability, it doesn't seem unreasonable to ask contributors to follow code and performance guidelines.
Looking through the public Grafana dashboards I noticed that the database backups page is not showing any data points (http://monitor.gitlab.net/dashboard/db/backups). Is this meant to a public page?
We have a lot of caching in Redis, but that connects with the Rails app, not in front of PG. This way it can store the result (for example a page fragment) instead of just the db query that is part of it.
Regarding Spokes we're not planning something like that at this time. From https://gitlab.com/gitlab-org/gitaly/issues/650 "GitHub eventually moved to Spokes https://githubengineering.com/building-resilience-in-spokes/ that had multiple fileservers for the same repository. We will probably not need that. We run networked storage in the cloud where the public cloud provider is responsible for the redundancy of the files. To do cross availability zone failover we'll use GitLab Geo."
Most of our problems thus far have been the result of misuse of our resources one way or another. GitHub on the other hand probably dealt with these issues earlier on (or are able to hide them somehow, perhaps by throwing a lot of physical hardware at the problem), thus they can now focus on dealing with problems caused by reaching the limits of their (old) systems.
Did you consider trying NewSQL DBs that would let you scale horizontally with much less complexity? Curious to see if the performance is comparable yet.
Postgres is generally a much better choice for relational data. For their hosted platform they should eventually consider Cassandra for some of the most commonly used tables. (the table storing revisions for instance)
User mentioned "trying". With an app that gets this much use, it might be reasonable to take a small piece and fork all writes to both the current + an experimental DB and then A/B test the reads. If this type of abstraction requires "a complete rewrite of GitLab" for "trying NewSQL DBs" then something is amiss.
Given that Gitlab is open source, adding a NewSQL DB into the mix would be pushing that dependency to downstream users, who would then have to take on the maintenance burden. It may end up being the right call from a performance standpoint but just too complex to ask users to manage.
This burden already exists with dependencies (and we should probably say downstream "admins" instead of "users" to clarify). A self-hosted app should not be hamstrung by its initial dependency choices. If tested and implemented properly it can go from optional experimental flag, to multiple-storage-backends-supported abstraction, to [maybe] even we're-migrating-to-new-storage-backend, etc. The primary downside in these cases is the flexibility often provided to plugin and extension writers which marries them to early tech choices. The more admins that demand low change count on forward movement are indirectly the reason many companies choose to offer SaaS only where the visibility and flexibility are removed.
Note, I'm not saying this is the case with GitLab at all and I assume Postgres will remain the primary choice for most of their DB uses in perpetuity, but for some use cases there is better tech to use and the only reason to say no shouldn't just be a dependency addition (though it is often a good reason among many to say no).
It would require a complete rewrite for at least the data layer, as everything right now assumes an RDBMS. Further it would complicate installation as now everybody would have to use a fancy "NewSQL" database, some of which you may not even be able to run yourself.
I think NewSQL refers to database systems that still enforce ACID transactions, but use various strategies to avoid the scalability difficulties that have traditionally been associated with RDBMS as both consistent and available systems.
Apparently I should stop using gitlab.com By choosing to not shard your data you have basically put a cap on scalability... or at least one which will be reached faster than if you were to decide to shard.
Web service architectures are like shoes. You want what fits now, not what you hope will fit later.
Chances are very good that preemptive scaling will just result in unnecessary development overhead and an architecture/set of abstractions that still fail to handle the expected load since they weren't battle tested against it.
I am a bit disappointed at this article and the hierarchy of how to optimise postgres. As a MySQL specialist that has worked on a few Postgres projects, I can tell you that Postgres has a VERY RICH toolkit to speed things up. Between optimising queries and load balancing, you have indexing, partitioning.. (Postgres even allows you to index EACH partition differently) and triggers to do summary/rollup tables for you.
Its just that Postgres may need a little more coding and is less automatic than MySQL that people shy away from it..
I do like the time and information spent about spreading load accross database servers and not nearly enough people are considering splitting reads and writes from a master/slave setup.
Its just that its frustrating to me that engineers view databases as some sort of mysterious black box that you need to solve by working around it or replacing it entirely.
I had my jaw dropped on that point -- they didn't use connection pooling? WTF? I'm seeing connection pools on pretty much any project I've worked on, even under 1/req/day load, it's pretty much "must have" since 90s. And it's really just a library (like dbcp2) and a few lines of config, with almost no drawbacks.
I cannot believe engineers at GitLab level are so bad at, well, engineering. I'm losing faith in their produce, please show me that I'm wrong.
We've had connection pooling on the client side since forever, just like any other Rails project. This article specifically talks about server side pooling.
Why are they bad at engineering just for not using a feature that you were aware of? If anything gitlab have proven to be reasonably good at engineering, given they’ve made it this far with a single postgre instance with no tweaking. It also speaks wonders for pg itself.
> A side effect of using transaction pooling is that you cannot use prepared statements, as the PREPARE and EXECUTE commands may end up running in different connections; producing errors as a result.
Uhhh... WHAT? I'll simply sit by and watch for SQL injection vulnerabilities resulting from this change. Either you have a very good SQL query writer engine with rock solid escaping or you will get pwned by this.
> You can have perfectly valid input escaping without needing prepared statements.
People have bypassed this far too often for my taste, and there is a reason why SQL (and other injections) are Top 1 of OWASP (1st place actually, both in RC1 and 2 of 2017, but also in the 2013 edition).
Never, ever trust any form of escaping or it will bite you. Hard.
What is transaction pooling anyway? A transaction exists for a specific action, it shouldn't be shared among users or multiple connections. It sounds like they've taken transactions and removed the isolation.
I don't know what the data access is like, but I get the impression that it's like some "enterprise" patterns I've seen where data access is too "encapsulated", ie, each DAL opens it's own connection, pulls it's chunk of data and each request involves many of these instead of having each request open the connection and pass it down to the data access layers.
"transaction-level" pooling might be a more apt description. Instead of assigning each incoming connection to a dedication upstream connection for the entire duration of the incoming connection, it assigns the upstream connections on a per-transaction basis. When each transaction ends, the upstream connection is returned to the pool. A better description is at https://wiki.postgresql.org/wiki/PgBouncer.
In Postgres, beyond the SQL PREPARE commands, there is also a protocol for executing prepared statements. Their client drivers almost certainly use this protocol, which would mean the normal placeholder safety applies. I think this is just a misunderstanding of their point, and not a full backtracking to string escaping.
Nevertheless, the slowness is really annoying, especially because their product is so good on all other accounts. If scaling their database can help speed things up, I bet they will be glad to remove this embarrassing "missing feature".
In marketing terms, having fast page load would be called a "qualifier". For example: you expect a hotel to provide toilet paper. You won't pick any hotel because of it, but you will surely avoid one that doesn't.