Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Scaling the GitLab database (about.gitlab.com)
228 points by fanf2 on Oct 30, 2017 | hide | past | favorite | 107 comments



Funny thing is that they publicly own up to their performance problems in an unusual way: in their comparison with GitHub (https://about.gitlab.com/comparison/) they list "Fast page load" as a feature that GitLab lacks and GitHub has.

Nevertheless, the slowness is really annoying, especially because their product is so good on all other accounts. If scaling their database can help speed things up, I bet they will be glad to remove this embarrassing "missing feature".

In marketing terms, having fast page load would be called a "qualifier". For example: you expect a hotel to provide toilet paper. You won't pick any hotel because of it, but you will surely avoid one that doesn't.


The good news is that page loads are much better now. Our initial page load ping http://stats.pingdom.com/81vpf8jyr1h9/1902794 is better than GitHub.com http://stats.pingdom.com/81vpf8jyr1h9/1902795

We got work to do in the 99% and merge request page load but the overall situation has improve dramatically.

We still got work to do in availability, so I changed the 'feature' to reflect this https://gitlab.com/gitlab-com/www-gitlab-com/commit/3b3bddf5...


Another GitLabber here.

We just kicked off a major effort to address the availability issue Sid mentions above. The highlights are that we're moving to GCP which should provide better underlying reliability. But interestingly we found that only about ~20% of our downtime minutes where from underlying infrastructure. Whereas ~70% came from features that didn't scale.

So the more exciting part of the project is to tighten the feedback loop between development and deployment with a continuous delivery pipeline. This may be obvious to some people, but it's harder to pull off when you've got an open source project, an on-prem product, and a large-scale SaaS sharing the same code base. I'm calling it "Open-core SaaS" and there are only a handful of companies that run a large, multi-tenant service based on an open source project.


Maybe I just can't find it, but what URLs are pingdom.com checking? Do both issues have the same content? Or does one have 50 comments and another has 1?


These are checking https://github.com/gitlabhq/gitlabhq/issues/1 and https://gitlab.com/gitlab-org/gitlab-ce/issues/1

The one on GitLab is considerably longer.


That's probably the best "feature comparison" page I've ever seen. (At least, the best written by a creator of one of the products being compared.)


Thanks! We try to keep it fair. I love that our marketing team is on board with listing our missing features very publicly https://about.gitlab.com/features/#missing


> 2/3 of Enterprises Use GitLab

wow that a lot higher that i would have expected. Linked bitrise page used " randomly selected 10,000 apps as a base, that are getting built regularly on Bitrise" , not "enterprises".

How did you comeup with 2/3 number? First line of your about page is a lie?


I love that they do though... It's so honest, and it gives me the feeling that I can trust that that might be the biggest of their problems, and not something else they don't want to talk about it.


>In marketing terms, having fast page load would be called a "qualifier".

"Table stakes" is a common term for this.


My team is currently using the hosted cituscloud, which uses PG Bouncer. They note that the reason for sharding is because of high writes. We've actually seen big benefits for moving over to a sharded setup via Citus just as much for the read performance.

By sharding by customer we're able to more effectively leverage Postgres caching and elastically scale. Since switching over, our database has performed and scaled much better than when we were on a single node.

There is some up front migration work. But that's limited some minor patching of activeRecord ORM, and changes to migration scripts. We considered casandra, elastic search, as well as dynamoDB, but the amount of migration changes in the application logic would take an unacceptable amount of time.


Craig from Citus here. Thanks for the kind words Justin, as you mention there is some upfront work, but once it's in place it becomes fairly manageable.

We've been working to make that upfront work easier as well with libraries that allow things to be more drop-in (ActiveRecord-multi-tenant: https://github.com/citusdata/activerecord-multi-tenant and Django-multitenant: https://github.com/citusdata/django-multitenant)


Praying for a http://docs.sequelizejs.com/ plugin


For a customer DB, why was Citus required for your use-case? Do you have millions of users?


One of the things that I'm now excited about for citus is multi node IO capacity. It should improve write and read latencies in the situation where a single node disk is getting saturated.

* Edit. I AM excited for this. Typo'd now = not previously


Not sure I fully follow on the multi node IO capacity, can you share a bit more on the workload that you're concerned about?

Edit: Thanks for the clarification, the typo part in particular through me off, makes sense now.


Let's say that you're using RDS and have a single box capable of ~25k IOPS. If you have 16 boxes capable of some number of IOPS (let's say 15k), then your total system capacity is significantly higher than 25k. On read workloads where pages need pulled from disk (high read IOPS), this should see improvement in general.

Secondaries cover this case for Gitlab, it seems, but that comes with a set of caveats as well (namely async availability of data).


> Secondaries cover this case for Gitlab, it seems, but that comes with a set of caveats as well (namely async availability of data).

Another caveat is that all secondaries will end up having more or less the same stuff in their cache. As your data set grows bigger that becomes unsustainable, because you're going to read from disk more and more.

When you shard across N nodes you can keep N times as much data in the cache. Combined with N times higher I/O and compute capacity, that can actually give you way more than N times higher read throughput than a single node (for data sets that don't fit in memory), and you can get much higher write throughput as well.


Great points, thanks!


This article is also very useful in showing just how far you can push Postgres _without_ reaching for any of these optimizations. I've seen too many projects worry about these things very early on in their lifecycle, when in reality they are no where close to having enough traffic to cause a problem.


Yes, you can typically push PostgreSQL very far while still using a fairly simple setup (e.g. no sharding). Unfortunately too many times people have this mindset that a slow application is the result of a slow/bad database (as in "it's an RDBMS and RDBMS' don't scale"), and not the result of it being misused (e.g. badly written queries, lack of proper indexes, that sort of thing). At GitLab it took quite a while to get everybody to see this.


1. Write web application with little to no idea how an RDBMS works

2. Run into inevitable performance problems

3. Decide that your database is at fault, rewrite in NoSQL

4. Buy into NoSQL hype, push 5x resources on your NoSQL solution

5. Profit?

There are NoSQL options that are great for lots of things. It takes some significant expertise to know if your thing is one of those things. Expertise you probably don't have if you don't even have a moderately deep knowledge of RDBMS-es.


    > Buy into NoSQL hype, push 5x resources on your NoSQL solution
NoSQL is so last year, NewSQL is the future!


Lack of indexes vs way too many indexes that writes suffer.


And almost certainly you can push postgresql even more by throwing hardware (ie. RAM) at the problem, although this is probably the point when it stops being worthwhile for reasonably sane applications (I've seen deployments where single pgsql server with 2TB+ RAM was cheaper solution than this kind of optimisation, due to all the technical debt in applications)


Exactly. When they confessed they didn't use connection pooling I almost fell off my chair.


Great point.

Is easy to overlook how much you can do with a decent RDBMS. And for the small data that gitlab use, I believe still exist a lot of big wins on performance.

Is just that the new generation not pay much attention to Sql databases...

P.D: I don't mean the gitlab developers, just on general


The whole point why you should worry from the beginning is so that you don't have to re-work everything and put a huge risk to the business when you have to do it.


No. No. Never do this.

You never actually know where your bottlenecks are going to be until they arrive and by designing everything with a "super scalable" architecture you will be making development ten times as painful and expensive as it needs to be while throwing away nice things that come "for free" and "just work" at the mid-low end like transactions.

Amazon and Netflix don't want to have to use their hideously complex and inefficient service architectures - they're forced to because of their scale.

Most people who engineer their systems for hyper-scale from the get-go never see a whiff of anything that looks remotely like high traffic. Often they go out of business before they get anywhere near that.

And once you get to serious scale, you really shouldn't still be running the code from back when you didn't really know what your business/product was.


Certainly. However, there's a difference between keeping scalability in mind, and actually implementing/maintaining a complex database scaling system at the onset of a project.

Edit: I would also add that most projects I've seen tend to undergo one or more large refactors / redesigns before growing to a size where complex DB scaling is needed. This is, of course, speaking from the perspective of a small startup or pet project. If you're writing a new feature for an existing product where you can assume that it'll have lots of users, then you would obviously build scaling right from the start.


Instead you appear to be advocating for over engineering a solution for a problem you don’t yet have where the opportunity costs are probably features customers can use. That is also a huge risk to a business, but one that is immediate.


Thinking about how you would shard your data is hardly over engineering. On some storage systems you are forced to anyway: Cassandra, dynamodb.

And the cost of building the solution from the onset is hardly a killer. The pain usually comes from operational complexity.


for anyone new to gitlab, it is a bear. even their own cluster/HA documentation is mostly just a thin roadmap to a final goal. a statement of intent if you will. There are just far too many moving parts in this software and the glaring problem of "ruby doesnt scale well" is everpresent. packaged releases depend entirely on a bundled chef installation to get everything out the door and running.

The software tries to do too much. CI/wiki/code review/git/container repository/analytics package.

CI sharding per team and carving up gitlab into departments and groups has been my only solution so far, and its just distributed risk at that. its the difference between the unix ethos of do one thing and do it well, versus the F35 approach of do everything forever.


    > and the glaring problem of "ruby doesnt scale well" is everpresent
GitLab has many performance problems (and many have been solved over the years), but Ruby has thus far not been one of them.


I remember installing GitLab on a small NAS-type box. It was a while ago, but on each start-up it ran a nodejs tool to pre-compile some assets, I suppose. On that machine, it took 10-15 minutes to start. Afterwards, the unicorn workers kept getting killed because the default memory limit (128 MB) wasn't enough to process more than literally a couple of requests. It did work, but pages took 2-5 seconds to load, which I couldn't stand.

For anyone needing a small Git Web UI, I suggest https://github.com/go-gitea/gitea.

EDIT: Oh, you're that guy ^^;.

Anyway, my rant was probably a bit harsh. I know that GitLab isn't really written for hardware comparable to a Raspberry Pi and that it's used by small and large teams all over the world. But on my hardware Gitea renders pages in 20-100 ms, which is orders of magnitude faster than GitLab. I'm sure part of this is the fault of the language.


So just use cgit on your 128mb box and don't complain about gitlab? It's like installing Windows 10 on a machine with 1gb memory, starting Photoshop and complaining about swapping and that it's running slow.

Gitlab runs fine out of the box with 4gb memory and when screwing with some provided knobs it's fine with 2gb or less memory.

It's in the docs, it's in every FAQ, every Stackoverflow answer... so why the hate?

64mb memory? Use gitolite. 256mb memory? Use cgit + gitolite 1gb memory? use gitea or whatever else there is. 4gb memory? use gitlab if you want.

I'm not affiliated but I can't stand the "gitlab doesn't run on my home wifi router - it sucks" posts here... it works pretty well if you read & acknowledge the documentation and requirements.


I'm running a Gitlab instance for a smallish dev team of 10.

I haven't gotten it to run reliably with any less than 8Gb, even though we don't have a whole lot of activity there.

It takes surprisingly long to start up too, even on a relatively beefy host.

I do like the product though, the all-in-one solution with code hosting, issues, code review (needs work though...) and CI is great. (Not using the deployment, orchestration stuff).

My biggest issue is that they seem to be too focused on pumping out new features.

There are 9k open issues just for CE. That's probably 12k+ for all products (runner, EE,...).

That either is a sign of very bad project management or a way to big agenda. Probably both.


I'm surprised to hear GitLab isn't running reliably with 8GB. Are you running the Omnibus installations?

The open issues for CE https://gitlab.com/gitlab-org/gitlab-ce/issues do include feature requests.


I'm using the official docker image.

It's been very pain-less in setup and maintainance. Upgrades were mostly painless as well until now.

With 8GB for the docker container it's solid and pretty stable, but any less and behaviour would be very unstable (timeouts, very long request load times, ...).

Of course those 8GB are for the whole stack (Postgres, Redis, Nginx, Unicorn instances, ...), so I'll admit that 8GB for the Docker container is quite important context here. ;)


Thanks for the answer. I would have thought that 4GB is enough. Thanks for providing a data point that it might not be.


https://about.gitlab.com/2015/04/21/gitlab-on-raspberry-pi-2...

My box has 16 GB of memory. The 128 MB value was the default value configured in unicorn. If you're not familiar with that (I wasn't), it has a feature to restart the web server workers when they end up using too much memory. That's what unicorn does and apparently it's what people do in Ruby web apps.

I've also used GitLab on a (real) server. It still feels too slow for my taste (pages load in 1-2s). No, it's not unusable, but I won't call that "fine".


> For anyone needing a small Git Web UI, I suggest https://github.com/go-gitea/gitea.

But be aware, gitea (like gogs) does NOT cache SQL queries, so you are heavily limited by that. I can’t get it to serve > 200 pageloads per second on my system, while even a normal Grails projects manages to serve 3000 (with more complicated queries).


Thanks for the heads-up. In my case I'm the single user, so it doesn't matter, but I'm sure it could handle a lot more.

It used to cache something, either SQL queries or rendered pages, but it might have stopped in the meanwhile.

You made me curious, so I just tested it: 102 rps on a project page, 365 rps on a user profile, as reported by wrk2.


That sounds like the webpack development server was somehow started, which definitely is not required (or even recommended) for production setups.


> I'm sure part of this is the fault of the language.

"Fault" is a pretty loaded word here. Pretty sure my old TI-86 also would struggle to run GitLab...


IIRC this step is not needed/required anymore.


On the flipside, I quite enjoy Gitlab.

CI/CD that is very easy to use. I can self host it - so no $600 a month contract like some of the other guys. Decent Issues that are not as heavyweight as Jira. Place for code / CI/CD / Issues to all be in one place.


I find the GitLab approach a non-starter for me too.

Both the "everything all in one" approach, and their insistence on "omnibus" packages that bundle fucking everything into a giant deb/rpm.

Its unfortunate to me that there isn't a simple (as in cgit/gitweb/hgweb) oss repo web viewer that handles multiple repo types.

I want issues, ci, project management separate.


Running PG bouncer is a very basic optimization. You typically start out with PG bouncer in your stack if you have experience running Postgres.

If you're new to running your own postgres databases you should also check out Wall-e: https://github.com/wal-e/wal-e And the awesome pg stat statements https://www.postgresql.org/docs/10/static/pgstatstatements.h...


If you are already familiar with wal-e, or even if not, you might want to consider taking a look at wal-g[1]. Wal-g is a newer edition of wal-e written in go that we've seen can have up to 7x performance improvements[2].

[1] https://github.com/wal-g/wal-g

[2] https://www.citusdata.com/blog/2017/08/18/introducing-wal-g-...


I really wanted to use wal-g, but they had NO support for any backend storage besides S3, and did not appear to have any interest in developing more backends.. (We use GCP)


While wal-g looks great, having evaluated it a bit, I worry that nobody seems to be maintaining it based on issues/pr's I looked at.


Scratch that - I see fresh commits - will give it another spin!


I’ve set up both pgbouncer and wal-e in the past, and they’re both worth it to have from effectively the very beginning.

But it leaves me wondering: is there any sort of packaging of Postgres and these “friends” into a single opinionated virtual appliance, such that I could just stick up a couple of VMs with the same image with different role tags, and get a good cluster with automatic transparent N:M proxying, automatic transparent backup-and-restore, etc?

In other words, is there a product that is to Postgres as Gitlab CE is to Github (or like Dokku is to Heroku): a “host it yourself, but easily, and starting from a scale of 1 with no overhead” equivalent of a SaaS service?


If you're deploying on K8S at least Crunchy Data has you covered [0], I haven't used it myself but it's the closest I've been able to find without going with something like EnterpriseDB.

I'm pretty sure there's some ansible/puppet/chef/saltstack code out there to build something similar as well.

[0] https://github.com/CrunchyData/crunchy-containers


There's a postgres operator by Zalando which sets up postgres with WAL-E and auto failover using Patroni.

https://github.com/zalando-incubator/postgres-operator


Other options available:

- CockroachDB 1.1

- AWS RDS Aurora PostgreSQL-compatibility

[1] https://news.ycombinator.com/item?id=15458900 [2] https://news.ycombinator.com/item?id=13072861


Aurora RDS-PG still isn't GA, as in isn't available in all datacenters.

Edit: Apparently it is GA as of 6 days ago, but still only available in 4 regions https://aws.amazon.com/blogs/aws/now-available-amazon-aurora...


There are only 9 regions with at least 3 AZs (as required by Aurora design), it's available in 4 of them.


CockroachDB is a really cool product, but I think by all accounts it is still nowhere near performant enough to be used in production.


Or, you know, use something like Cassandra that’s used at massive scales already (by Apple, Netflix, instagram, uber, Walmart, IBM, etc), and scales linearly to make future growth trivial.

Bonus points that rm -rf on a Cassandra server isn’t a problem, which seems like it could have been useful for gitlab.


Alternative is Google Spanner. Also still new, but already in GA for 5 months now.


The spanner setup that can match PG on a high end x86 cluster will run way over 100K month


Well, at least Spanner has already been in development for 10 years.


We're moving away from gitlab because the performance is too much to bear. We're hearing "aaaaaarg I really hate gitlab" a few times a day.

The pipeline pages takes 7 seconds to load. It's really bad. We stopped using issues because just listing them was a chore. Pushing 1 file takes at least 30 seconds to 1 minute.

Anyways, the work falls on me to replace it. We're going with Phabricator and Jenkins.


Is that self-hosted Gitlab or their hosted version? Would you mind sharing details on your setup? Just curious as we're currently building out our own setup.


Right now:

- Self hosted Gitlab CE

- Gitlab runners in pre-emptibles nodes with autoscaling

Future (almost done, still in testing phase):

- Phabricator (for git, code review and issues)

- Jenkins ("lightweight" CI, more details at the bottom)

- Google Cloud Container Builder ("Heavyweight" CI)

Our main repository is in Phabricator, but it has mirroring to Google Cloud Repository (for backup and faster heavyweight builds)

On commits, we run the lightweight CI in Jenkins. What I mean by lightweight is that, there's only a few runners and what it does is check what has changed. We have a mono-repository with over 30 microservices, rebuilding all of them is a waste of time and resources, so we have scripts and easy to use manifests that defines what to do. We also have a lot of small CI checks that is just not worth it to run externally, like linting.

Here's what our homegrown manifests looks like

  build_graphql:
    when_changed:
      - graphql
      - node_libraries/player-graphdb
      - node_libraries/player-auth
      - node_libraries/player-models
    script_build: graphql/cloudbuild.yaml
    script_skip: graphql/cloudbuild-skip.yaml
The script skip takes the last successful build and retag the docker images to the current one. The build script sends the instructions to Google Cloud Container Builder to build the docker image, run the tests, and push it. We use Cloud Container Builder because managing those CI servers really sucks.

Let me know what else you want to know.


Sounds like a design flaw... adding a centralized database to augment a decentralized version control system.

Comments, issues, pull requests, wiki, etc... should all be first class Git objects. I should be able to create pull requests while offline and push them up when I'm back online.


Issues can be in dvcs with something like BugsEveryehere.

Git has native "pull requests" it's just not web based, it's email based.

A good wiki is markdown docs in a git/hg repo and something like Gollum to serve web viewers.

So - in a way those things can all be handled as regular content in git/hg.


I would rather baseline git be light and barebones - all the extra stuff is nice to have from a provider, but being part of git itself would make it bloat out.


> A side effect of using transaction pooling is that you cannot use prepared statements, as the PREPARE and EXECUTE commands may end up running in different connections.

There is a way to use prepared statements, it's called pre_prepare [0]. You still need to tweak your code though to not execute PREPARE directly but rather put statements in special table, but you do that only once, and after that you EXECUTE like before. Another minor (in my view) caveat is that you need to reset pgbouncer↔postgres connections each time you update that statements table.

[0] https://github.com/dimitri/preprepare


Simple -- just clean the database from time to time [1].

BTW, I like how the banner on the bottom of [1] wants me to "Try GitLab Enterprise Edition risk-free for 30 days." currently.

[1] https://about.gitlab.com/2017/02/10/postmortem-of-database-o...


> Sharding would also affect the process of contributing changes to GitLab as every contributor would now have to make sure a shard key is present in their queries.

Why is this a problem? If you need to make a change to improve scalability, it doesn't seem unreasonable to ask contributors to follow code and performance guidelines.


Looking through the public Grafana dashboards I noticed that the database backups page is not showing any data points (http://monitor.gitlab.net/dashboard/db/backups). Is this meant to a public page?


This is meant to be a public page. It loads async. After 3 seconds I see the following https://www.dropbox.com/s/5mt5f77e60yogk1/Screenshot%202017-...


There was an issue with the dashboard causing it to try and load data from the wrong place, this has been fixed. See https://gitlab.com/gitlab-com/infrastructure/issues/3115 for more info.


Not seeing any mention of using something like Memcached (or anything similar) in front of PG for caching.

Is that because it's already been done and not in scope?

Asking because for (at least) web applications, using Memcached (or similar) can significantly reduce the load on the backend database. :)


We have a lot of caching in Redis, but that connects with the Rails app, not in front of PG. This way it can store the result (for example a page fragment) instead of just the db query that is part of it.


Cool, just making sure. :)


It's interesting to see what parts of the product GitLab is struggling with compared to GitHub, e.g. https://githubengineering.com/stretching-spokes/


I'm very impressed with the level GitHub database testing is at https://githubengineering.com/mysql-testing-automation-at-gi... We still got a lot of work ahead of us. But Yorick and Greg their work has been outstanding.

Regarding Spokes we're not planning something like that at this time. From https://gitlab.com/gitlab-org/gitaly/issues/650 "GitHub eventually moved to Spokes https://githubengineering.com/building-resilience-in-spokes/ that had multiple fileservers for the same repository. We will probably not need that. We run networked storage in the cloud where the public cloud provider is responsible for the redundancy of the files. To do cross availability zone failover we'll use GitLab Geo."


Thanks for chiming in, sytse and YorickPeterse!

Re: misuse, I remember this incident https://news.ycombinator.com/item?id=11245652 made me chuckle and cry a little. If you build it, it will be misused.


Most of our problems thus far have been the result of misuse of our resources one way or another. GitHub on the other hand probably dealt with these issues earlier on (or are able to hide them somehow, perhaps by throwing a lot of physical hardware at the problem), thus they can now focus on dealing with problems caused by reaching the limits of their (old) systems.


Did you consider trying NewSQL DBs that would let you scale horizontally with much less complexity? Curious to see if the performance is comparable yet.


Postgres is generally a much better choice for relational data. For their hosted platform they should eventually consider Cassandra for some of the most commonly used tables. (the table storing revisions for instance)


No, because it probably would have required a complete rewrite of GitLab to make it work.


User mentioned "trying". With an app that gets this much use, it might be reasonable to take a small piece and fork all writes to both the current + an experimental DB and then A/B test the reads. If this type of abstraction requires "a complete rewrite of GitLab" for "trying NewSQL DBs" then something is amiss.


Given that Gitlab is open source, adding a NewSQL DB into the mix would be pushing that dependency to downstream users, who would then have to take on the maintenance burden. It may end up being the right call from a performance standpoint but just too complex to ask users to manage.


This burden already exists with dependencies (and we should probably say downstream "admins" instead of "users" to clarify). A self-hosted app should not be hamstrung by its initial dependency choices. If tested and implemented properly it can go from optional experimental flag, to multiple-storage-backends-supported abstraction, to [maybe] even we're-migrating-to-new-storage-backend, etc. The primary downside in these cases is the flexibility often provided to plugin and extension writers which marries them to early tech choices. The more admins that demand low change count on forward movement are indirectly the reason many companies choose to offer SaaS only where the visibility and flexibility are removed.

Note, I'm not saying this is the case with GitLab at all and I assume Postgres will remain the primary choice for most of their DB uses in perpetuity, but for some use cases there is better tech to use and the only reason to say no shouldn't just be a dependency addition (though it is often a good reason among many to say no).


It would require a complete rewrite for at least the data layer, as everything right now assumes an RDBMS. Further it would complicate installation as now everybody would have to use a fancy "NewSQL" database, some of which you may not even be able to run yourself.


Are you talking about NoSQL databases, like Mongo and Redis?


I think NewSQL refers to database systems that still enforce ACID transactions, but use various strategies to avoid the scalability difficulties that have traditionally been associated with RDBMS as both consistent and available systems.

https://en.wikipedia.org/wiki/NewSQL


Apparently I should stop using gitlab.com By choosing to not shard your data you have basically put a cap on scalability... or at least one which will be reached faster than if you were to decide to shard.


Web service architectures are like shoes. You want what fits now, not what you hope will fit later.

Chances are very good that preemptive scaling will just result in unnecessary development overhead and an architecture/set of abstractions that still fail to handle the expected load since they weren't battle tested against it.


Following that same logic you probably should stop using a lot of websites out there.


I am a bit disappointed at this article and the hierarchy of how to optimise postgres. As a MySQL specialist that has worked on a few Postgres projects, I can tell you that Postgres has a VERY RICH toolkit to speed things up. Between optimising queries and load balancing, you have indexing, partitioning.. (Postgres even allows you to index EACH partition differently) and triggers to do summary/rollup tables for you.

Its just that Postgres may need a little more coding and is less automatic than MySQL that people shy away from it..

I do like the time and information spent about spreading load accross database servers and not nearly enough people are considering splitting reads and writes from a master/slave setup.

Its just that its frustrating to me that engineers view databases as some sort of mysterious black box that you need to solve by working around it or replacing it entirely.


> 2. Use a connection pooler ...

I had my jaw dropped on that point -- they didn't use connection pooling? WTF? I'm seeing connection pools on pretty much any project I've worked on, even under 1/req/day load, it's pretty much "must have" since 90s. And it's really just a library (like dbcp2) and a few lines of config, with almost no drawbacks.

I cannot believe engineers at GitLab level are so bad at, well, engineering. I'm losing faith in their produce, please show me that I'm wrong.


We've had connection pooling on the client side since forever, just like any other Rails project. This article specifically talks about server side pooling.


Why are they bad at engineering just for not using a feature that you were aware of? If anything gitlab have proven to be reasonably good at engineering, given they’ve made it this far with a single postgre instance with no tweaking. It also speaks wonders for pg itself.


> A side effect of using transaction pooling is that you cannot use prepared statements, as the PREPARE and EXECUTE commands may end up running in different connections; producing errors as a result.

Uhhh... WHAT? I'll simply sit by and watch for SQL injection vulnerabilities resulting from this change. Either you have a very good SQL query writer engine with rock solid escaping or you will get pwned by this.


You can have perfectly valid input escaping without needing prepared statements. Many ORMs handle this without the need for PREPARE/EXECUTE.


> You can have perfectly valid input escaping without needing prepared statements.

People have bypassed this far too often for my taste, and there is a reason why SQL (and other injections) are Top 1 of OWASP (1st place actually, both in RC1 and 2 of 2017, but also in the 2013 edition).

Never, ever trust any form of escaping or it will bite you. Hard.


> Never, ever trust any form of escaping or it will bite you. Hard.

This is an absurd dogmatic statement. Everybody trusts a multitude of escapings everywhere every day all over the web without even knowing it.


How much do you trust the postgres developers?

https://www.postgresql.org/docs/9.1/static/libpq-exec.html#L...


What is transaction pooling anyway? A transaction exists for a specific action, it shouldn't be shared among users or multiple connections. It sounds like they've taken transactions and removed the isolation.

I don't know what the data access is like, but I get the impression that it's like some "enterprise" patterns I've seen where data access is too "encapsulated", ie, each DAL opens it's own connection, pulls it's chunk of data and each request involves many of these instead of having each request open the connection and pass it down to the data access layers.


"transaction-level" pooling might be a more apt description. Instead of assigning each incoming connection to a dedication upstream connection for the entire duration of the incoming connection, it assigns the upstream connections on a per-transaction basis. When each transaction ends, the upstream connection is returned to the pool. A better description is at https://wiki.postgresql.org/wiki/PgBouncer.


In Postgres, beyond the SQL PREPARE commands, there is also a protocol for executing prepared statements. Their client drivers almost certainly use this protocol, which would mean the normal placeholder safety applies. I think this is just a misunderstanding of their point, and not a full backtracking to string escaping.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: