
Gitlab.com performance degradation: Postgresql split – Only one DB node active - vidarh
https://docs.google.com/document/d/1_IzyO-jwqb7UFl0A28D1gR4EaU99cEnoUSD9o8q4eZw/preview
======
vidarh
I submitted this mostly because their decision to publish this document during
the incident (still ongoing as of writing this) makes it particularly
interesting. A lot of places wouldn't give this level of detail _after_ a
problem, much less during...

~~~
siganakis
Its great to see they are so open about their issues, but as a paying customer
its a real shame how frequently they seem to suffer from outages and
performance degradation.

I love the product, but hate that I can't rely on it.

~~~
vidarh
I share the sentiment. They do have some lessons to learn.... One being "test
how long fully restoring databases from catastrophic scenarios will take and
consider if you can afford it to take that long; if not take corrective
action". People constantly seem to forget just how slow restores or syncing a
new database replica can get, or assume they'll never need to.

~~~
zzzcpan
Well, you can't really solve high availability problem by testing how long
replication and restoration take. It's a lot more nuanced than that and
requires real expertise in distributed systems. Usually it means getting rid
of PostgreSQL/MySQL completely in favor of distributed solutions, as it's
cheaper and is a better investment into the infrastructure, than attempting to
build high availability on top of it.

~~~
vidarh
You can get high-enough availability just fine on Postgres. Very few
applications require zero downtime. With pgbouncer or similar in front, you
can generally flip to a slave with very minimal impact. The issue comes in
situations like the one in this case where a mistake leads to being left
without up to date slaves and your system can't handle the read load on a
single server.

I agree with you in principle, but for most systems it's total overkill. It
wouldn't be total overkill if distributed solutions were easy to set up and
without tradeoffs, but we're nowhere near being there.

In most cases then, restoration time is the biggest barrier to getting "high-
enough" availability without re-engineering everything for a totally different
system. Often you can prevent that from becoming an issue by siloing
functionality into separate databases, offloading logs and analytics for
example. Or buying faster SSDs for your DB servers... There are many
approaches depending on the size of your dataset, and most people never
outgrow those options.

To put it this way: Gitlab.com's database is small enough that fitting it in
RAM on a commodity server is easily doable. While they'd still need to have
snapshots on disk, at that point beating the restore speeds they're reporting
would be trivial.

------
usr1106
While the document might reveal that in some aspects they were not well
prepared, I must say opening up the document rather increases my trust than
reduces it. I'm sure other companies have their risks and poor practices, too.
Why should I trust them more for not communicating? (Marketing BS is not
communication. Communication requires a receiver and I typically deny to
mentally receive it. At least not without reading heavily between the lines)

------
al2o3cr
In case you've got a flaky memory like I do, this is a different issue from
this one last year:

[https://about.gitlab.com/2017/02/10/postmortem-of-
database-o...](https://about.gitlab.com/2017/02/10/postmortem-of-database-
outage-of-january-31/)

