
Gitlab 500 error - sbhn
https://gitlab.com
======
GrayShade
I always found it disheartening to read their changelogs and see how every
release (I've looked at) fixes some "N+1 SELECTs" problem [1].

This will be an unpopular opinion here, but I feel we've (collectively)
failed. Between ORMs and the dynamic language _du jour_ , we have become so
far removed from how computers work that most software development is a waste
of electricity. We're using tens of servers and complex architectures for
simple sites, just because we think that's the only way to scale them [2]. And
our computers are slower than they ever were [3] [4].

I only gave a couple of examples, but look around, you can find them
everywhere. Everything is awful.

[1] The issue tracker is now down, but you can try it when it comes back:
[https://about.gitlab.com/releases/](https://about.gitlab.com/releases/).

[2]
[https://www.usenix.org/system/files/conference/hotos15/hotos...](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-
mcsherry.pdf)

[3] On older computers GNOME can't even keep up to update the mouse cursor.

[4] I'm honestly not sure that Windows 10 boots faster from an SSD than XP
from an HDD.

~~~
adtac
This is why I have a mild dislike for ORMs. They're pretty handy, but most of
the time, I prefer to hand-make SQL statements. Are they more work? Yes. But
_no_ ORM will ever give me the actual flexibility and performance of raw SQL
queries (unless it becomes SQL itself).

~~~
mdpopescu
Not a bad point. Then again, once you hit your first 1200-line stored
procedure, full of logical bugs including SQL injections, you'll realize SQL
isn't that amazing either :)

~~~
dimgl
If your stored procedure has 1200 lines, you're doing something wrong.

~~~
DisposableMike
How about recreating a report to feed into a legacy system from an ERP
installation with 5,000 tables and 12,000 stored procedures - none of which
you're allowed to modify or face losing support?

------
jedberg
I gotta say I love but am surprised by the public google doc of the outages.
If I worked there I would personally advocate for not doing that, only because
I feel like it would slow down resolution if I have to constantly be worrying
about not putting "secret" things into the doc. Unless they have someone who's
job is to take their real chat and transcribe it, in which case, that's great
because that person is useful for a postmortem as well.

Speaking of postmortems, reading through the Google Doc of their outage
timeline, my first postmortem question would be why is the website dependent
on the database being up? Shouldn't there at least be a fallback that says
"We're offline right now sorry" when the database is unavailable? Or in the
name of extreme transparency, have it frame or show a link to the google doc
of the outage?

My second question would be how often do we test failures of a read slave, of
the master, and of pgbouncer? Do we at least have a practiced procedure for
manual failover if not an automatic failover?

My third question would be what sort of alerts do we have, if any, on
replication lag?

It's been a while since I've had to run a major outage, this was an
interesting exercise for me!

~~~
YorickPeterse
> Speaking of postmortems, reading through the Google Doc of their outage
> timeline, my first postmortem question would be why is the website dependent
> on the database being up?

You can't do anything with GitLab without the database being available.

> Shouldn't there at least be a fallback that says "We're offline right now
> sorry" when the database is unavailable? Or in the name of extreme
> transparency, have it frame or show a link to the google doc of the outage?

We have a deploy page, but it requires us to explicitly enable it, which can
be easy to forget. I do think adding a link to our Twitter status to the error
pages could be useful, or maybe a widget of some kind.

> My second question would be how often do we test failures of a read slave,
> of the master, and of pgbouncer?

In the past we tested certain scenarios whenever necessary. For example, when
introducing new parts of the database load balancer we'd test what happens if
we terminate X or Y. However, we do not yet do periodic chaos monkey style
testing, which has been on our todo list for quite a while.

> Do we at least have a practiced procedure for manual failover if not an
> automatic failover?

Failovers are actively worked on ([https://gitlab.com/gitlab-
com/database/issues/91](https://gitlab.com/gitlab-com/database/issues/91),
[https://gitlab.com/gitlab-com/database/issues/86](https://gitlab.com/gitlab-
com/database/issues/86), [https://gitlab.com/gitlab-
com/database/issues/40](https://gitlab.com/gitlab-com/database/issues/40)).
Right now our failover procedure is still quite painful, and it's easy to make
mistakes. We held off using this approach for our maintenance because of these
reasons, and also because a failover usually results in 2-3 hours of slow
loading times. Ironically we ended up with just that.

> My third question would be what sort of alerts do we have, if any, on
> replication lag?

Plenty: [https://gitlab.com/gitlab-
com/runbooks/blob/810824765ec45534...](https://gitlab.com/gitlab-
com/runbooks/blob/810824765ec455344bdae35300a352be8c147dbb/alerts/postgresqls.yml)

In this particular instance they go to a Slack channel called "#database", and
this channel has been on fire with alerts since the problem started.

~~~
jedberg
Wow sweet, I didn't actually expect a response! Nice work.

> We have a deploy page, but it requires us to explicitly enable it, which can
> be easy to forget. I do think adding a link to our Twitter status to the
> error pages could be useful, or maybe a widget of some kind.

Or make it automatic when the database isn't available?

> In this particular instance they go to a Slack channel called "#database",
> and this channel has been on fire with alerts since the problem started.

Good to know. My next question would be "were there any signs of impending
doom before the event happened that could have warned us, and do we have
alerts on that?"

~~~
YorickPeterse
> Or make it automatic when the database isn't available?

I agree this is probably a good idea, so I created [https://gitlab.com/gitlab-
com/infrastructure/issues/4382](https://gitlab.com/gitlab-
com/infrastructure/issues/4382) to keep track of things.

> were there any signs of impending doom before the event happened that could
> have warned us, and do we have alerts on that?

I don't think you can detect corrupt WAL until it is actually corrupt. In
those case the host won't start up, for which we already have various errors
that cover that case (e.g. there's an error for when too few transactions are
happening).

There also wasn't really any impending doom before we started things. Two
issues that triggered this were:

1\. A lack of `set -e` 2\. Apparently when we stop a command using `gitlab-
ctl`, it will time out if the process doesn't stop fast enough. I'm not sure
what exit status we report in that case, but in this instance it resulted in
the script continuing to run.

------
sbuttgereit
I have growing concerns over GitLab. I appreciate the openness and
forthrightness of their handling of these sorts of incidents and in regard to
other issues, but there comes a point where none of that matters if the
product itself is inherently unreliable. This post isn't about any one
incident, including this one, but an ongoing trend that I don't get the sense
is improving.

To be completely fair, I'm on the free plans and have no rightful expectation
of any sort of performance level. Having said that, I have considered moving
to paid tiers, and I do advise others in regard to these sorts of services
(whether cloud based or on-premise). Every time I see a planned maintenance,
of "under 1m", I simply realize that I shouldn't plan anything important for
GitLab that day. I can't see that this would be any different with GitLab.com
paid plans and I have to imagine there is something inherently difficult in
managing the software if the developer of that software has issues with common
maintenance; this colors my impressions of what this might be like in-house.
It seems a beast, I get that there are scaling issues that are different
between something like gitlab.com and GitLab on-premise, but history of these
things is coloring my impression.

At some point moving fast and breaking things needs to give way to the pursuit
of enough stability that users aren't overly concerned about whether or not
the tools they depend on, or more importantly the data they're storing, are
regularly at risk.

I don't want to buy services/products from a company that is just trustworthy
and open when things go wrong. I want to buy services/products from companies
that only need to prove those qualities with some rarity.

~~~
jazoom
It always seems to be either their storage abstraction (Ceph?) or Postgres.
I've never used Postgres but from what I see it's not really designed for
enormous scale (it's a very old project). Perhaps they would be well served to
see if CockroachDB gives them more stability. I've started using it at a small
scale and the clustering aspect seems legit.

~~~
sbuttgereit
I work with PostgreSQL quite a lot and I don't think this is the case.

Yes, PostgreSQL is a mature project. But that has little or no bearing by
itself on what degree it can scale; Linux is just about as old, yet it still
drives most of the infrastructure we're talking about. In context, PostgreSQL
has been under continuous development since inception and no more so than in
the past decade or so with substantial community and corporate backing. To
conflate project age with current robustness is to indulge in fallacious
thinking (in this case "cum hoc ergo propter hoc", if I'm not mistaken). The
project has advanced with the times. There are good examples of PostgreSQL
running at scale, including at companies operating at scale such as Instagram,
Skype (up to at least the point of the Microsoft acquisition), Pandora, as
well as others. Most of these, I'd bet, are/were using PostgreSQL at scales
substantially larger than that faced by GitLab.

Relational databases require good management and good design to function
properly. There were people who specialized in this called Data Base
Administrators (at least the management piece anyway). My feeling (perhaps
unjustified) is in start-upish environments there is a tendency to minimize
DBA expertise in favor of having more conventional developers that can "get
the database to work"... which is a different standard than "getting the
database to perform". That's mostly not my world, so I may be jumping to
conclusions (I'm an enterprise systems guy on most days).

I think GitLab probably has a complex software product and lacks the correct
expertise in infrastructure (including, yes, DBAs) for their SaaS offering.
I'm reading tea leaves, but that would be a perfect storm which would produce
exactly what we see no matter how robust any one piece of this puzzle might
be.

~~~
jazoom
> Yes, PostgreSQL is a mature project. But that has little or no bearing by
> itself on what degree it can scale; Linux is just about as old, yet it still
> drives most of the infrastructure we're talking about.

I knew someone here would misinterpret why I mentioned its age. As I said,
it's not what it was designed for. Perhaps it has bolted on things in the 20+
years since it was first conceived, but it wasn't purpose-built for scale.

> To conflate project age with current robustness is to indulge in fallacious
> thinking

It only makes sense to talk about "robustness" in the context something was
designed to work in. I can't abuse a system and call it unrobust. Of course
Postgres is robust. You're twisting what I was saying.

> There are good examples of PostgreSQL running at scale, including at
> companies operating at scale such as Instagram

I'd actually be keen to learn more about how Postgres works at Instagram scale
without their own customisations to make it work in that setting.

If Postgres works well in that setting why do things like this exist?
[https://github.com/citusdata/citus](https://github.com/citusdata/citus)

------
dpcx
Gitlab working document [https://docs.google.com/document/d/1WmOMKq63Rap2wfQ-
yy7sHyoc...](https://docs.google.com/document/d/1WmOMKq63Rap2wfQ-
yy7sHyocFvmbJgIt1UIj2xEXEco/preview)

~~~
lunixbochs
> Bash scripts did not have set -x so continued to run even after the stop
> command had failed.

I hope that's a typo in transcription and they actually used `set -e` to fix
(and maybe `set -o pipefail`), `set -x` does something entirely different.

~~~
YorickPeterse
Yes that's a typo, it should have been `set -e`

------
KaiserPro
The hosted product, whilst far more fully featured than github, is horridly
unstable.

Its got to the point where I am seriously considering moving our repos (~250)
to a self hosted CE version.

Looking at the outage reports, it doesn't fill me with confidence. A lack of
redundancy, no systematic testing of backups, a seeming lack of QA.

Gitlab has an outage of average once every two weeks, compared to github's
once every 6 _months_ (look at the status pages)

------
sergiotapia
Please send some love to the engineer responsible. Been on the causing end of
a 500 error, even though I fixed it in 10 minutes, it's never fun!

------
cyphar
It's a little bit concerning to note that the last GitLab outage was also
caused by Postgres replication issues, which eventually culminated in some
pretty bad data loss and a realisation that the most recent backups they had
weren't usable. Maybe they should reach out to someone from Postgres to help
figure out why their replication appears to have some sort of systemic
issue...

------
winslow
I'm also seeing a 5xx Server error on instagram.

Edit: ~15 mins later Instagram seems to be back up.

Edit #2: Seems to be back to 5xx server error ~1hr now.

~~~
LusoTycoon
still seeing it

------
LoneWolf
Curious StackOverflow also has a 500, could they be related?

------
dubcanada
They do database maintenance at 11am EST on a Tuesday? Seems like a terrible
time to do database maintenance?

------
thinkindie
They probably need stronger arguments than "Micro$oft is really bad" if they
want to have an edge against Github. Professional tool solidity is way more
important than the company process transparence.

------
geekrax
So now that there's no outage, this post has no significance and no way to see
what was happening during the outage. Does anyone have a cached version of the
page or a screenshot for some context?

------
dustfinger
Yes, I confirm the same. They are both returning server 500 errors. That is
quite a coincidence.

------
txmjs
Had quite some trouble pushing this afternoon. Hope it gets resolved soon.

------
dustfinger
stackoverflow is up now. gitlab still down.

------
philip1209
Time for another outage tshirt?

------
MrFurious
GitLab with problems?, again?.

------
yathomasi1
WOW back at the right moment.

------
dustfinger
stackoverflow was actually returning a 503, not a 500.

------
phonon
Should have used AWS Aurora Postgres

------
dustfinger
> [https://outage.report/stackoverflow](https://outage.report/stackoverflow)

------
sbhn
Excellent, my fork-me-on-gitlab project is viewable again.

[https://gitlab.com/seanwasere/fork-me-on-
gitlab](https://gitlab.com/seanwasere/fork-me-on-gitlab)

