Hacker News new | comments | show | ask | jobs | submit login
Gitlab 500 error (gitlab.com)
120 points by sbhn 6 days ago | hide | past | web | favorite | 116 comments





I always found it disheartening to read their changelogs and see how every release (I've looked at) fixes some "N+1 SELECTs" problem [1].

This will be an unpopular opinion here, but I feel we've (collectively) failed. Between ORMs and the dynamic language du jour, we have become so far removed from how computers work that most software development is a waste of electricity. We're using tens of servers and complex architectures for simple sites, just because we think that's the only way to scale them [2]. And our computers are slower than they ever were [3] [4].

I only gave a couple of examples, but look around, you can find them everywhere. Everything is awful.

[1] The issue tracker is now down, but you can try it when it comes back: https://about.gitlab.com/releases/.

[2] https://www.usenix.org/system/files/conference/hotos15/hotos...

[3] On older computers GNOME can't even keep up to update the mouse cursor.

[4] I'm honestly not sure that Windows 10 boots faster from an SSD than XP from an HDD.


I really think you're on to something here. I see at my place of work adding additional complexity as something to be celebrated in terms of overcomplicated pipelines, "Big Data" for questionable reasons etc. I'm starting to become more and more sure this is just that people don't actually know how any of it works.

My mantra that I'm thinking to myself is everything can and will fail and you're adding layers of things to will fail so your pipeline just becomes extremely brittle. And, while there's nothing more frustrating that an procedural dump of linear code, at least debugging it is possible.

With these overly complex systems, your layers of abstraction just start fighting against you, often they'll be on different servers etc. All this is hard, and if it's not necessary don't just jump there.


In the tech industry, our ability to obtain new tools and find new problems to solve vastly outstrips the speed at which we can properly understand either the tools or the problems. When relevance in an organization is tied to your ability to leverage the new sexy, of course the equibrium behavior is to be half-cocked 100% of the time.

I think it's a cargo cult problem. We see massive companies showing off their solutions to gigantic engineering problems that can only be solved in a certain way, and we think that we also must solve problems that way. Or that we'll get promoted if we solve problems "like <insert-big-name-tech-company>" in our architecture.

Chances are, all you need is to properly use a relational database or another similar, proven, highly flexible and optimized tools, and you'll be fine until you have the time and money to solve the scale issues.


I think it might go deeper than just the engineers, although we are definitely a large part of it. The basic issue is that most companies don't compensate employees for compentently building software.

Technical debt is always a future managers problem so it's both, never worked on and preventing it is never seen as a big accomplishment when evaluating an engineer.

Secondly managers are people and people like flashy things. When someone says they built the website in vanilla js because it works and we won't need something more complicated for years, the manager isn't going to get that excited or remember when raise time comes around. If the engineer says they rewrote an application from [legacy language] to some hip new language that the manager has heard about from their peers in other companies who are also rewriting their applications, it's going to stick in the managers head when they are rating their employees.

You get what you pay for and modern companies are paying for semi working but flashy software more than they pay for solid but boring software


I think the better way to put it that "we" stopped caring about performance both on the client (think Electron) and on the server (ORMs & dynamic languages, as already mentioned) at some point of time in favour of faster time to market, cheaper. Just so that the output is "good enough".

But honestly, I agree with that approach to a certain (!) extent. I don't want to write all of my backend in C++. I use Java or Python instead. Thousand reasons: maven/pip, portability etc.

So, I would say we need to come to terms with ourselves and agree that the productivity is sometimes increased at the expense of the system quality. After that, I think it's okay to keep doing that if both us and our customers are OK with the tradeoff.

Just to illustrate: do I like the RAM consumption & the startup time of my Slack app? No. But I am a free user, so I can't complain. In the end, the Slack team probably made the right choice.

UPD: 500 error is gone. Gitlab also needs to manage costs and be friendly to the OSS contributors. Ruby was also probably a good choice for them. If you want top-notch software, be prepared to stomach NASA-grade costs, because the rumour is that 1 LOC in Shuttle ended up costing $1500 [1].

P.S. Thanks a lot for the [2] paper reference!

[1] https://www.johndcook.com/blog/2009/10/08/nasa-buggy-softwar...


> I don't want to write all of my backend in C++.

The difficulty of C++ is way too much over rated. Most mediocre projects can be written with basics of C++ (that everyone can learn in less than 3 months).

We could have lived in a way performant world, if people would stop horrifying beginners about difficulties of C++/Rust/...

Go talk to a beginner, the moment you say anything about C/C++, the first thing they mention is, what do you do for buffer overflows? They don't even know what it is, but they are worried about it.


> We could have lived in a way performant world, if people would stop horrifying beginners about difficulties of C++/Rust/...

You can't get around the fact that languages like Python and JavaScript are better suited for beginners. From their perspective, it should be horrifying. At least until they properly learn it.


I do agree. But the trend/courage should be toward informative programming (how OS works, how DB works, how file systems works, ...). What we see now is exactly the opposite. Programmers are becoming more ignorant day after day. Even worse, people get credit for running away from C/C++/OCaml/Rust/,...

The first thing I would mention when talking about C is a hash table. But yes, you are right in general. However, my plan for this summer is to learn Clojure because I believe that learning macros will serve me better right now [1].

[1]: http://www.paulgraham.com/avg.html


> I don't want to write all of my backend in C++. I use Java or Python instead. Thousand reasons: maven/pip, portability etc.

Is this really a good excuse with containers being used so frequently?


Maven/pip is still a good excuse. I don't want to wait x months until a new release of some library lands in Debian repos.

Portability – not so much, but then it saves you so much headache because switching from 16.04 to 18.04 should not break anything in your system.


> rumour is that 1 LOC in Shuttle ended up costing $1500 [1].

Should have used K&R braces


Why? K&R braces start and terminate on their own lines, which would push up the LOC count.

Thus reducing the cost per LOC :D

Ahhh, I misinterpreted :)

> This will be an unpopular opinion here, but I feel we've (collectively) failed. Between ORMs and the dynamic language du jour, we have become so far removed from how computers work that most software development is a waste of electricity.

I just wanted to say that I was kinda relieved reading that, because I often feel like I’m the only person who thinks that.


you guys need to read this blog https://mechanical-sympathy.blogspot.com/

I don't know what you mean; I've just been working on 3 Node lambdas, deployed by CircleCI to AWS infrastructure provisioned by terraform which take a CSV from an email and output multiple CSVs.

Hey, at least you know the system will probably scale when you have to deal with a million emails and ten million outputs!

This is why I have a mild dislike for ORMs. They're pretty handy, but most of the time, I prefer to hand-make SQL statements. Are they more work? Yes. But no ORM will ever give me the actual flexibility and performance of raw SQL queries (unless it becomes SQL itself).

In a shop I used to work at all of the devs were looking into Entity Framework as an abstraction layer for access to an MSSQL database. They were using T4 generators to create classes based off of the tables they were interested in (it was hundreds) and so they were looking into unit of work/repository patterns.

I told them I had plenty of experience going down this route (my most popular StackOverflow question is on UOW/repo) and to just use Dapper and not look back. There is a pleasant simplicity in using plain old C# classes and simple queries that you can visualize just by looking at them without having to run a trace to figure out what's actually being executed.

This was one battle of many. I lost that battle.


You'll love this: https://www.earthli.com/news/view_article.php?id=2288

> This was one battle of many. I lost that battle.

Have a hug :-).


If you take an inexperienced dev and have them hand write SQL it is not guaranteed to be very good either.

Either way with ORM or SQL you have to understand the underlying database technology.

The problem is lack of skills not tech.


I agree, and also think the tools we use shape us (and vice-versa).

It may be a tradeoff between lower standards (e.g. "Can you use an ORM?") and steeper learning curve before someone can produce useful work. The former optimizes for more bodies that meet a certain criteria, the latter may optimize for a higher skill level on average.

There are many other factors that contribute, but it's interesting to think about. (I know, for example, the Navy SEALs lowered their standards a ways back to increase enrollment... and some friends have said it's had a negative impact on the teams, because they have taken some people with behavioral profiles that would never have made it in the past. But they had more missions due to changing nature of warfare, and needed the numbers.)


Maybe it's perception. I don't view ORM as a tool to make working with a database easier. In fact, it's harder to use but can have real advantages. For example, it would be nice to use raw SQL in our app but we have lots of reporting and data analysis functionality with support for custom grouping, filtering, etc... Composing queries on the fly with an ORM is trivial. On the other hand composing SQL strings or writing out manually the thousands of potential combinations would be very painful and brittle.

I also find ORMs are great for transactional updates where you are only modifying a few entities at a time. They really reduce the amount of boilerplate required (reducing errors) and allow you to use refactoring tools (find usages, etc..) to locate where a particular column is used in the code base.

So to me it's not about making SQL easier - because lets face it you have a complex translation layer there that requires full understanding. Rather it's about capturing those clear benefits.


The day-to-day is certainly up for debate -- I prefer SQL myself. But imagine the initial learning curve for someone fairly new to programming. Persisting data to, and reading data from, a "database" is a lot easier in code than learning SQL (which frequently use a client-server system, only further increasing the surface area of knowledge required to produce working results).

The difficulty required to think about a solution increases significantly with additional components.


I prefer the control of writing SQL too. However, you have to be careful of SQL injection attacks.

Honestly, if you are developing a web app, when you have it working, part of the routine should be to look at the SQL queries for each operation and spot any N+1 issues. I use Django, so this is easy to do in Django Debug toolbar. I'm sure similar tooling exists in other frameworks as well.


> you have to be careful of SQL injection attacks.

Surely you're using prepared statements?!


I've string interpolated SQL queries before... I know

I'll only do this for internal apps, not for anything on the public internet. This is not a good practice.


I agree. I've moved toward raw SQL in almost every app I build.

That said, ORMs are far from equal. Some are much better and closer to SQL than others. I've been impressed with ActiveRecord in that regard, and also Ecto


Elixir's Ecto is probably the most impressive ORM I've ever used. It absolutely blew my mind when I used it.

I wrote an ORM once, back in the mid-2000s. One of my principles was that when I know a damn good SQL query that will get me what I want, the ORM better not get in the way.

The result? My query generating classes would happily take in individual parts of an SQL statement, or a whole query, handle it through the same database abstractions as a generated one, and give me back workable objects as best as possible.

This came as a result of seeing what miserable performance I got handling joins as application logic, recursing through things one at a time when the correct query would bring me back every piece of information I needed at once.

I still use some of that old code from time to time, mostly because I miss how well it meshed with the way I think about things. Sometimes writing your own, if you've got the time, is a nice way to become truly productive and comfortable.


> But no ORM will ever give me the actual flexibility and performance of raw SQL queries (unless it becomes SQL itself).

Enter SQLAlchemy Expressions.

The reason why that's actually an excellent idea is twofold: For one, the entity-mapping part (i.e. you can write very complex queries and still get your mapped objects back), the other is that you're essentially writing the AST of your SQL queries using real objects, so it's very easy to manipulate and reuse, which is quite hard to do with textual SQL. (Also it's easy to mix textual SQL and SQL expressions in SQLAlchemy; but the need to do that is very, very rare indeed).


I'm fine with "thin" ORMs that map rows to objects, and not much more. You can just write SQL (or use a view) while avoiding some repetitive code.

Funny enough, I had gone from ORM to SQL a couple years ago and now I've just recently gone back to ORM. For simple projects, raw SQL can be quite good. But once your model is complex enough that you're doing 3+ way joins relatively often, mapping the results manually by hand is very time consuming, and building an abstraction is really building the beginnings of your own crappy ORM.

ORMs do so much more than mapping. How about defining views and using those with your ORM?

Not a bad point. Then again, once you hit your first 1200-line stored procedure, full of logical bugs including SQL injections, you'll realize SQL isn't that amazing either :)

If your stored procedure has 1200 lines, you're doing something wrong.

How about recreating a report to feed into a legacy system from an ERP installation with 5,000 tables and 12,000 stored procedures - none of which you're allowed to modify or face losing support?

I also try my best to avoid SQL procedures as much as possible (except for database migrations; no avoiding them there). The 1200 line procedure should definitely be written in a real programming language, but what's done is done, I guess haha.

1. This is not an unpopular opinion on HN, quite the opposite, I see comments like this every day 2. Don't attribute a failure of a programmer to the ORM. Tech like ORMs allow more people access to the underlying technology, yes. So there will be more errors like N+1. But that's not the purpose of ORM, just a side effect. The purpose is to speed development for those who have already learned SQL. Can't blame an N+1 error on the ORM unless it somehow prohibits you from addressing such a problem, which no modern ORM does.

> Don't attribute a failure of a programmer to the ORM

While I sympathize with this, I would like ORMs to make it a lot harder to make these mistakes.

Especially frameworks for dynamic languages (in my experience) have performance issues by default (and when following their docs). That should probably change.


The Jevons paradox applies to computers [1]. Additional computational power is mostly wasted. That's why I use Xfce. The additional bloat doesn't accomplish anything in my view. (And I might switch to something lighter in the future.)

[1] https://en.wikipedia.org/wiki/Jevons_paradox


Regarding Xfce and lightweight desktops, no desktop environment, a few services, X, a tiling window manager (e.g. stumpwm or i3) and a terminal can easily fit into less than 256 MB of RAM. My Arch and Nix setups are in fact close to 128.

When casual users insist on needing 16 GB of RAM to surf the web, watch videos and chat it makes me cringe.


Regarding RAM consumption. I'm on a Windows box right now. The Windows Desktop Manager takes ~75MB RAM, but my Firefox process (with most tabs suspended) takes >2GB RAM. I run into the same on my Linux and BSD machines at home.

I run i3 and the XFCE settings daemon for convenience. Sure, I could optimize more, but it's really not the area where I'll see a lot of benefit. My best optimization for standard users is an automatic tab suspender.


Why is this general rant on the state of software development relevant to a Gitlab outage?

The constant issues GitLab has had with updates/deployments echoes back to a common thread that many people feel about software in general, which is that we have needleslly over-complicated things, making things brittle.

Wether it's GitLab's stack that is brittle, or that the team doesn't have experience dealing with growing pains, it doesn't change the point.


The parent commenter is saying (or rather, insinuating) that GitLab's outage is symptomatic of a broader problem in the software industry, vis á vis unnecessary complexity.

I think it was worded a bit harshly, but there's a kernel of truth there. I don't mean to pile onto GitLab in particular, but take a look at this open proposal for hardware upgrades at GitLab in late 2016: https://about.gitlab.com/2016/12/11/proposed-server-purchase.... I don't think I can do a better job of explaining what's wrong with that proposal than this thread, from the ensuing HN discussion: https://news.ycombinator.com/item?id=13153860.

Our industry is plagued with architecture decisions that pile layers and layers of complexity onto otherwise straightforward technical problems in pursuit of scalable abstraction. Even companies that legitimately have to think about "scale" choose architectures which technically impressive but brittle in practice.


Parent has never shipped an N + 1 from their armchair. It's amazing, really.

This issue is different, but gitlab.com usually has poor availability which I'm sure it's caused in part by technical decisions they made.

Is it? There are plenty of Rails sites, Basecamp included, that have pretty high uptime. GitHub is also based on Rails, and while it could be better, it is substantially more stable than GitLab IMO.

SQL optimization is a dying art. Amazes me when I dig through software we've purchased to find abominations that I have to refactor. I've sent emails to developers with the fixes for shit we've paid for.

Ah yes I love receiving Smug Reports[0] in my inbox. A 30,000' fix provided by someone with a 10' view of the code base and no business understanding outside their own organization.

[0] https://blog.codinghorror.com/new-programming-jargon/ Number 4


That seems like a pretty dismissive take. I've been on the other side of some truly ignorant bug reports, but I've also sent reports with code that, to the best of my knowledge, resolves the problem. If I received a bug report wherein someone was clearly annoyed with software they paid me for but still put the effort into trying to understand and resolve on their own, I'd probably prioritize that email and try to be empathetic about it.

Ah yes, I love receiving Smug Responses from developers who think they know what's best for their spaghetti code. I've been working with SQL longer than some of these kids have been alive. I know my shit, and I know when a query pings my slow query logs because of inefficient nonsense. Generally your end users have put it through far more paces, than you who probably don't use it in any production capacity. There's ZERO excuse for poor indexes, or NONE at all! Do what you want with the provided fix, because I've already applied to our systems to keep the server from crumbling.

Could it be that you have the 30,000' view, while they had a ground view and climbed up a mountain to fix the view for all those in their town?

Out of frustration I've broke out IDA Pro enough times to scratch my own itch that it's turned me to into a near Stallman-esque fan of open source software. When you work for small companies, bug reports on proprietary software issue trackers often get neglected.


This.

I fear the usage of ORM are slowing the web and developers SQL knowledge is atrophying along with it. I miss the days of crushing a query from 0.4s to 0.0014s. I've see some pretty ludicrous SQL execution plans generated by ORMs.


No joke, finding that eureka moment of cutting query times down a few orders a magnitude is one of my great pleasures.

Better, automated tools for optimization are needed. Postgres ecosystem right now lacks a lot of things that enterprise systems have for decades.

This complaint of software eating up all the gains made by hardware has been noted since the 80s: https://en.wikipedia.org/wiki/Wirth%27s_law

Which is to say there's not going to be any sort of "victory" industry-wide here. Still, ranting is fun. My favorite recent-ish writing is linked from https://twitter.com/CarloPescio/status/903603552890834947 Programming as if the Domain (and Performance) Mattered

You can learn new things by keeping up with rants too. The best algorithm depends not only on the broad architecture but even the microarchitecture. Newer generations of Intel CPUs for instance throw many older optimizations out the window, but you'll still get the occasional suggestions for those optimizations from people who haven't kept up to date.

I'm not sure Ruby counts as a dynamic language du jour given how old it is, but I wish more dynamic languages followed the Lisp tradition instead of the interpreted Basic tradition, especially with capabilities that let you optimize dynamically as needed.

    * (defun add-opt (x y)
    (declare (optimize (speed 3) (space 3) (safety 0) (debug 0)) ; not all these matter in this example, just showing options
             (type fixnum x y))
    (the fixnum (+ x y)))

    ADD-OPT
    * (disassemble 'add-opt)

    ; disassembly for ADD-OPT
    ; Size: 9 bytes
    ; 02D0C2FF:       4801FA           ADD RDX, RDI               ; no-arg-parsing entry point
    ;      302:       488BE5           MOV RSP, RBP
    ;      305:       F8               CLC
    ;      306:       5D               POP RBP
    ;      307:       C3               RET

> I'm not sure Ruby counts as a dynamic language du jour given how old it is

I believe it still did when Gitlab started


All this complexity is a sure sign that our trade is growing up. We've been around for long enough that the world depends on multigenerational legacy systems we've created.

The economics of developing and maintaining all kinds of systems push us in the direction of easy-to-inspect and easy-to-develop code rather than highly optimized code. We use existing systems (like SQL, like UNIX derivatives, like various internet protocols, like git, like Ruby, Javascript and the rest) because they're almost completely debugged, even if they're sometimes clunky.

Even if it's hard to maintain our mission-critical legacy systems, it's not as hard as maintaining the signaling in the New York City subways. Yet.

And nobody HAS to use Gnome or the wizziest GUI sugar on other systems.

Yes, it's discouraging when things don't work smoothly. But our job, as hackers, is to do our best to keep things running.


Interestingly, git itself is a counterexample to your very point. It's written in reasonably tight code and without major runtime dependencies to run fast on a single system with a simple filesystem backing. And depending on who you ask and what their tastes are, it does maybe 80% of the job that all these giant web sites are being asked to do internally, without any rigging or support beyond a remote ssh account.

So "we" haven't collectively failed, just some of us.


I spend most of my engineering time on the front-end, and I 100% agree. We are wasting so many resources that I'm genuinely disgusted by it.

I don't think it's a matter of not understanding how SQL selects work etc., it's more that ORMs don't make it easy to find all the places in our code which might need this relationship prefetched. The fact that it's often a simple fix once you've found a place like that is something to celebrate, honestly.

I don't think orms are really to blame - can you imagine all of the sql injected,barely functioning select/joins the internet would be dealing with right now? The real problem: lazy developers are going to be lazy!

I consider ORMs to be a form of technical debt -- they're a great way to get to market quickly. Unfortunately like most technical debt, it usually just gets ignored until it's completely unsustainable.

Unfortunately the smartest minds in CS are exactly suggesting “don’t reinvent the wheel”. Maybe your comment is quite relevant for GitLab, because of their admirable publicity on their outages and code base, but I’m quite sure that even your TV has the same weird layers over layers that do something for the sake of “ease of developing”

I’m always surprised when I see a working complicated software, because I know that lots of projects, today are duck-taped patchwork.


> Unfortunately the smartest minds in CS are exactly suggesting “don’t reinvent the wheel”

Smartest minds never said that. Here is the correct quote: "Don't reinvent the wheel, but please learn how a wheel works".


Agree completely here. 20 years ago I was involved in a project that rolled out an integrated SAP/GIS system. The GIS server had 3Gb RAM and was able to support 100 users.

These days we can't even roll out a client computer with that little RAM -- and the end users can't get any more work done than they could 20 years ago.


> [4] I'm honestly not sure that Windows 10 boots faster from an SSD than XP from an HDD.

Win 10 LTSB on HDD would leave XP on SSD dead in the water, I believe - the sheer amount of crap that Windows has fed into the consumer and even the professional Windows editions, though, is unbelievable.


Yes and no.

When starting a new project it makes absolutely sense to use ORM and a dynamic language; the cost of "wasted electricity" is negligible in front of the development cost saved.



> This will be an unpopular opinion here, but I feel we've (collectively) failed.

We? NO. But those who refrained to learn SQL, yes. They've egregiously failed.


Well-implemented internal GraphQL APIs, and intelligent caching on the client, are a step in the right direction, separating query optimization from the data needed by the view layer. (Though you could argue that this is wasting cycles to have that level of abstraction, but abstraction can be useful in its own right.) This is still ongoing work; for instance, see https://github.com/tfoxy/graphene-django-optimizer which is in progress as of two days ago, and attempts to do away with the N+1 problem systematically.

But more generally, if you're worried about this, choose to work on projects whose impact on the world (however you want to define it) is greater than the electricity your servers use. And if you're working in that direction, your speed to market is much more important than your electricity cost, or your support of legacy hardware. Design abstractions (ORMs, APIs) so that they can be optimized in the future, imagine that they'll be rewritten at that point in Go or Rust by dedicated teams, but in the meantime, stay laser-focused on your vision. This applies whether you're at the smallish vNext team at Microsoft, or a startup nobody's heard about. It's "performance debt," and it's often the most efficient way to capitalize your initiative.


I gotta say I love but am surprised by the public google doc of the outages. If I worked there I would personally advocate for not doing that, only because I feel like it would slow down resolution if I have to constantly be worrying about not putting "secret" things into the doc. Unless they have someone who's job is to take their real chat and transcribe it, in which case, that's great because that person is useful for a postmortem as well.

Speaking of postmortems, reading through the Google Doc of their outage timeline, my first postmortem question would be why is the website dependent on the database being up? Shouldn't there at least be a fallback that says "We're offline right now sorry" when the database is unavailable? Or in the name of extreme transparency, have it frame or show a link to the google doc of the outage?

My second question would be how often do we test failures of a read slave, of the master, and of pgbouncer? Do we at least have a practiced procedure for manual failover if not an automatic failover?

My third question would be what sort of alerts do we have, if any, on replication lag?

It's been a while since I've had to run a major outage, this was an interesting exercise for me!


why is the website dependent on the database being up? Shouldn't there at least be a fallback that says "We're offline right now sorry" when the database is unavailable?

According to the outage Google Doc it's supposed to do this but it stopped working a little while ago:

16h33 | Yorick: Why have the 502s on GitLab.com switched to 500s?

16h46 | Slight concern that 500s have not yet gone away, but let’s focus on standing up postgres-04 first...


Right exactly, my first question would be why did this happen, because it could lead to some interesting insight about an unknown dependency.

> Speaking of postmortems, reading through the Google Doc of their outage timeline, my first postmortem question would be why is the website dependent on the database being up?

You can't do anything with GitLab without the database being available.

> Shouldn't there at least be a fallback that says "We're offline right now sorry" when the database is unavailable? Or in the name of extreme transparency, have it frame or show a link to the google doc of the outage?

We have a deploy page, but it requires us to explicitly enable it, which can be easy to forget. I do think adding a link to our Twitter status to the error pages could be useful, or maybe a widget of some kind.

> My second question would be how often do we test failures of a read slave, of the master, and of pgbouncer?

In the past we tested certain scenarios whenever necessary. For example, when introducing new parts of the database load balancer we'd test what happens if we terminate X or Y. However, we do not yet do periodic chaos monkey style testing, which has been on our todo list for quite a while.

> Do we at least have a practiced procedure for manual failover if not an automatic failover?

Failovers are actively worked on (https://gitlab.com/gitlab-com/database/issues/91, https://gitlab.com/gitlab-com/database/issues/86, https://gitlab.com/gitlab-com/database/issues/40). Right now our failover procedure is still quite painful, and it's easy to make mistakes. We held off using this approach for our maintenance because of these reasons, and also because a failover usually results in 2-3 hours of slow loading times. Ironically we ended up with just that.

> My third question would be what sort of alerts do we have, if any, on replication lag?

Plenty: https://gitlab.com/gitlab-com/runbooks/blob/810824765ec45534...

In this particular instance they go to a Slack channel called "#database", and this channel has been on fire with alerts since the problem started.


Wow sweet, I didn't actually expect a response! Nice work.

> We have a deploy page, but it requires us to explicitly enable it, which can be easy to forget. I do think adding a link to our Twitter status to the error pages could be useful, or maybe a widget of some kind.

Or make it automatic when the database isn't available?

> In this particular instance they go to a Slack channel called "#database", and this channel has been on fire with alerts since the problem started.

Good to know. My next question would be "were there any signs of impending doom before the event happened that could have warned us, and do we have alerts on that?"


> Or make it automatic when the database isn't available?

I agree this is probably a good idea, so I created https://gitlab.com/gitlab-com/infrastructure/issues/4382 to keep track of things.

> were there any signs of impending doom before the event happened that could have warned us, and do we have alerts on that?

I don't think you can detect corrupt WAL until it is actually corrupt. In those case the host won't start up, for which we already have various errors that cover that case (e.g. there's an error for when too few transactions are happening).

There also wasn't really any impending doom before we started things. Two issues that triggered this were:

1. A lack of `set -e` 2. Apparently when we stop a command using `gitlab-ctl`, it will time out if the process doesn't stop fast enough. I'm not sure what exit status we report in that case, but in this instance it resulted in the script continuing to run.


you should probably start using zalando/patroni on a k8s cluster. and scale the db in k8s and maybe use a lb on the outside of k8s with ingress/lb. it's way simplier than repmgr, etc.

I'm happy to be corrected here, but in response to:

I have to constantly be worrying about not putting "secret" things into the doc. Unless they have someone who's job is to take their real chat and transcribe it, in which case, that's great because that person is useful for a postmortem as well.

I think they did only have one person updating the document, from what I could as I was following the live updates - only very occasionally did I see someone other than 'Andrew' update the document.


> why is the website dependent on the database being up?

I believe they use "GitLab Pages" to host their main, public-facing website. If Gitlab.com (the app I mean) is down, then so is their website.


I have growing concerns over GitLab. I appreciate the openness and forthrightness of their handling of these sorts of incidents and in regard to other issues, but there comes a point where none of that matters if the product itself is inherently unreliable. This post isn't about any one incident, including this one, but an ongoing trend that I don't get the sense is improving.

To be completely fair, I'm on the free plans and have no rightful expectation of any sort of performance level. Having said that, I have considered moving to paid tiers, and I do advise others in regard to these sorts of services (whether cloud based or on-premise). Every time I see a planned maintenance, of "under 1m", I simply realize that I shouldn't plan anything important for GitLab that day. I can't see that this would be any different with GitLab.com paid plans and I have to imagine there is something inherently difficult in managing the software if the developer of that software has issues with common maintenance; this colors my impressions of what this might be like in-house. It seems a beast, I get that there are scaling issues that are different between something like gitlab.com and GitLab on-premise, but history of these things is coloring my impression.

At some point moving fast and breaking things needs to give way to the pursuit of enough stability that users aren't overly concerned about whether or not the tools they depend on, or more importantly the data they're storing, are regularly at risk.

I don't want to buy services/products from a company that is just trustworthy and open when things go wrong. I want to buy services/products from companies that only need to prove those qualities with some rarity.


It always seems to be either their storage abstraction (Ceph?) or Postgres. I've never used Postgres but from what I see it's not really designed for enormous scale (it's a very old project). Perhaps they would be well served to see if CockroachDB gives them more stability. I've started using it at a small scale and the clustering aspect seems legit.

I work with PostgreSQL quite a lot and I don't think this is the case.

Yes, PostgreSQL is a mature project. But that has little or no bearing by itself on what degree it can scale; Linux is just about as old, yet it still drives most of the infrastructure we're talking about. In context, PostgreSQL has been under continuous development since inception and no more so than in the past decade or so with substantial community and corporate backing. To conflate project age with current robustness is to indulge in fallacious thinking (in this case "cum hoc ergo propter hoc", if I'm not mistaken). The project has advanced with the times. There are good examples of PostgreSQL running at scale, including at companies operating at scale such as Instagram, Skype (up to at least the point of the Microsoft acquisition), Pandora, as well as others. Most of these, I'd bet, are/were using PostgreSQL at scales substantially larger than that faced by GitLab.

Relational databases require good management and good design to function properly. There were people who specialized in this called Data Base Administrators (at least the management piece anyway). My feeling (perhaps unjustified) is in start-upish environments there is a tendency to minimize DBA expertise in favor of having more conventional developers that can "get the database to work"... which is a different standard than "getting the database to perform". That's mostly not my world, so I may be jumping to conclusions (I'm an enterprise systems guy on most days).

I think GitLab probably has a complex software product and lacks the correct expertise in infrastructure (including, yes, DBAs) for their SaaS offering. I'm reading tea leaves, but that would be a perfect storm which would produce exactly what we see no matter how robust any one piece of this puzzle might be.


> Yes, PostgreSQL is a mature project. But that has little or no bearing by itself on what degree it can scale; Linux is just about as old, yet it still drives most of the infrastructure we're talking about.

I knew someone here would misinterpret why I mentioned its age. As I said, it's not what it was designed for. Perhaps it has bolted on things in the 20+ years since it was first conceived, but it wasn't purpose-built for scale.

> To conflate project age with current robustness is to indulge in fallacious thinking

It only makes sense to talk about "robustness" in the context something was designed to work in. I can't abuse a system and call it unrobust. Of course Postgres is robust. You're twisting what I was saying.

> There are good examples of PostgreSQL running at scale, including at companies operating at scale such as Instagram

I'd actually be keen to learn more about how Postgres works at Instagram scale without their own customisations to make it work in that setting.

If Postgres works well in that setting why do things like this exist? https://github.com/citusdata/citus



> Bash scripts did not have set -x so continued to run even after the stop command had failed.

I hope that's a typo in transcription and they actually used `set -e` to fix (and maybe `set -o pipefail`), `set -x` does something entirely different.


Yes that's a typo, it should have been `set -e`

alas poor yorick

I’ve been through similar scenarios and they went down like this. Scrambling, making notes about things that should have been done while putting full understanding beside to fight the fire. I used to think there was a better way, now I’m convinced there is but it would take too much time across the stack and still go down anyways because you can’t control everything all the time forever.

Super transparent at least

Even that took 10s to load for me, haha.

    18h45 
    Yorick is doing the vacuuming. Vacuuming is not doing it’s thing. Why….
    
    18h48 
    Still not vacuuming.
    
    18h50 
    Yorick: still not sure why it’s not vacuuming.    
    We need a roomba.
I love them :)

The hosted product, whilst far more fully featured than github, is horridly unstable.

Its got to the point where I am seriously considering moving our repos (~250) to a self hosted CE version.

Looking at the outage reports, it doesn't fill me with confidence. A lack of redundancy, no systematic testing of backups, a seeming lack of QA.

Gitlab has an outage of average once every two weeks, compared to github's once every 6 _months_ (look at the status pages)


Please send some love to the engineer responsible. Been on the causing end of a 500 error, even though I fixed it in 10 minutes, it's never fun!

It's a little bit concerning to note that the last GitLab outage was also caused by Postgres replication issues, which eventually culminated in some pretty bad data loss and a realisation that the most recent backups they had weren't usable. Maybe they should reach out to someone from Postgres to help figure out why their replication appears to have some sort of systemic issue...

I'm also seeing a 5xx Server error on instagram.

Edit: ~15 mins later Instagram seems to be back up.

Edit #2: Seems to be back to 5xx server error ~1hr now.


still seeing it

Curious StackOverflow also has a 500, could they be related?

They do database maintenance at 11am EST on a Tuesday? Seems like a terrible time to do database maintenance?

They probably need stronger arguments than "Micro$oft is really bad" if they want to have an edge against Github. Professional tool solidity is way more important than the company process transparence.

So now that there's no outage, this post has no significance and no way to see what was happening during the outage. Does anyone have a cached version of the page or a screenshot for some context?

Yes, I confirm the same. They are both returning server 500 errors. That is quite a coincidence.

Had quite some trouble pushing this afternoon. Hope it gets resolved soon.

stackoverflow is up now. gitlab still down.

Time for another outage tshirt?

GitLab with problems?, again?.

WOW back at the right moment.

stackoverflow was actually returning a 503, not a 500.

Should have used AWS Aurora Postgres


Excellent, my fork-me-on-gitlab project is viewable again.

https://gitlab.com/seanwasere/fork-me-on-gitlab




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: