This will be an unpopular opinion here, but I feel we've (collectively) failed. Between ORMs and the dynamic language du jour, we have become so far removed from how computers work that most software development is a waste of electricity. We're using tens of servers and complex architectures for simple sites, just because we think that's the only way to scale them . And our computers are slower than they ever were  .
I only gave a couple of examples, but look around, you can find them everywhere. Everything is awful.
 The issue tracker is now down, but you can try it when it comes back: https://about.gitlab.com/releases/.
 On older computers GNOME can't even keep up to update the mouse cursor.
 I'm honestly not sure that Windows 10 boots faster from an SSD than XP from an HDD.
My mantra that I'm thinking to myself is everything can and will fail and you're adding layers of things to will fail so your pipeline just becomes extremely brittle. And, while there's nothing more frustrating that an procedural dump of linear code, at least debugging it is possible.
With these overly complex systems, your layers of abstraction just start fighting against you, often they'll be on different servers etc. All this is hard, and if it's not necessary don't just jump there.
Chances are, all you need is to properly use a relational database or another similar, proven, highly flexible and optimized tools, and you'll be fine until you have the time and money to solve the scale issues.
Technical debt is always a future managers problem so it's both, never worked on and preventing it is never seen as a big accomplishment when evaluating an engineer.
Secondly managers are people and people like flashy things. When someone says they built the website in vanilla js because it works and we won't need something more complicated for years, the manager isn't going to get that excited or remember when raise time comes around. If the engineer says they rewrote an application from [legacy language] to some hip new language that the manager has heard about from their peers in other companies who are also rewriting their applications, it's going to stick in the managers head when they are rating their employees.
You get what you pay for and modern companies are paying for semi working but flashy software more than they pay for solid but boring software
But honestly, I agree with that approach to a certain (!) extent. I don't want to write all of my backend in C++. I use Java or Python instead. Thousand reasons: maven/pip, portability etc.
So, I would say we need to come to terms with ourselves and agree that the productivity is sometimes increased at the expense of the system quality. After that, I think it's okay to keep doing that if both us and our customers are OK with the tradeoff.
Just to illustrate: do I like the RAM consumption & the startup time of my Slack app? No. But I am a free user, so I can't complain. In the end, the Slack team probably made the right choice.
UPD: 500 error is gone. Gitlab also needs to manage costs and be friendly to the OSS contributors. Ruby was also probably a good choice for them. If you want top-notch software, be prepared to stomach NASA-grade costs, because the rumour is that 1 LOC in Shuttle ended up costing $1500 .
P.S. Thanks a lot for the  paper reference!
The difficulty of C++ is way too much over rated. Most mediocre projects can be written with basics of C++ (that everyone can learn in less than 3 months).
We could have lived in a way performant world, if people would stop horrifying beginners about difficulties of C++/Rust/...
Go talk to a beginner, the moment you say anything about C/C++, the first thing they mention is, what do you do for buffer overflows? They don't even know what it is, but they are worried about it.
Is this really a good excuse with containers being used so frequently?
Portability – not so much, but then it saves you so much headache because switching from 16.04 to 18.04 should not break anything in your system.
Should have used K&R braces
I just wanted to say that I was kinda relieved reading that, because I often feel like I’m the only person who thinks that.
I told them I had plenty of experience going down this route (my most popular StackOverflow question is on UOW/repo) and to just use Dapper and not look back. There is a pleasant simplicity in using plain old C# classes and simple queries that you can visualize just by looking at them without having to run a trace to figure out what's actually being executed.
This was one battle of many. I lost that battle.
> This was one battle of many. I lost that battle.
Have a hug :-).
Either way with ORM or SQL you have to understand the underlying database technology.
The problem is lack of skills not tech.
It may be a tradeoff between lower standards (e.g. "Can you use an ORM?") and steeper learning curve before someone can produce useful work. The former optimizes for more bodies that meet a certain criteria, the latter may optimize for a higher skill level on average.
There are many other factors that contribute, but it's interesting to think about. (I know, for example, the Navy SEALs lowered their standards a ways back to increase enrollment... and some friends have said it's had a negative impact on the teams, because they have taken some people with behavioral profiles that would never have made it in the past. But they had more missions due to changing nature of warfare, and needed the numbers.)
I also find ORMs are great for transactional updates where you are only modifying a few entities at a time. They really reduce the amount of boilerplate required (reducing errors) and allow you to use refactoring tools (find usages, etc..) to locate where a particular column is used in the code base.
So to me it's not about making SQL easier - because lets face it you have a complex translation layer there that requires full understanding. Rather it's about capturing those clear benefits.
The difficulty required to think about a solution increases significantly with additional components.
Honestly, if you are developing a web app, when you have it working, part of the routine should be to look at the SQL queries for each operation and spot any N+1 issues. I use Django, so this is easy to do in Django Debug toolbar. I'm sure similar tooling exists in other frameworks as well.
Surely you're using prepared statements?!
I'll only do this for internal apps, not for anything on the public internet. This is not a good practice.
That said, ORMs are far from equal. Some are much better and closer to SQL than others. I've been impressed with ActiveRecord in that regard, and also Ecto
The result? My query generating classes would happily take in individual parts of an SQL statement, or a whole query, handle it through the same database abstractions as a generated one, and give me back workable objects as best as possible.
This came as a result of seeing what miserable performance I got handling joins as application logic, recursing through things one at a time when the correct query would bring me back every piece of information I needed at once.
I still use some of that old code from time to time, mostly because I miss how well it meshed with the way I think about things. Sometimes writing your own, if you've got the time, is a nice way to become truly productive and comfortable.
Enter SQLAlchemy Expressions.
The reason why that's actually an excellent idea is twofold: For one, the entity-mapping part (i.e. you can write very complex queries and still get your mapped objects back), the other is that you're essentially writing the AST of your SQL queries using real objects, so it's very easy to manipulate and reuse, which is quite hard to do with textual SQL. (Also it's easy to mix textual SQL and SQL expressions in SQLAlchemy; but the need to do that is very, very rare indeed).
While I sympathize with this, I would like ORMs to make it a lot harder to make these mistakes.
Especially frameworks for dynamic languages (in my experience) have performance issues by default (and when following their docs). That should probably change.
When casual users insist on needing 16 GB of RAM to surf the web, watch videos and chat it makes me cringe.
I run i3 and the XFCE settings daemon for convenience. Sure, I could optimize more, but it's really not the area where I'll see a lot of benefit. My best optimization for standard users is an automatic tab suspender.
Wether it's GitLab's stack that is brittle, or that the team doesn't have experience dealing with growing pains, it doesn't change the point.
I think it was worded a bit harshly, but there's a kernel of truth there. I don't mean to pile onto GitLab in particular, but take a look at this open proposal for hardware upgrades at GitLab in late 2016: https://about.gitlab.com/2016/12/11/proposed-server-purchase.... I don't think I can do a better job of explaining what's wrong with that proposal than this thread, from the ensuing HN discussion: https://news.ycombinator.com/item?id=13153860.
Our industry is plagued with architecture decisions that pile layers and layers of complexity onto otherwise straightforward technical problems in pursuit of scalable abstraction. Even companies that legitimately have to think about "scale" choose architectures which technically impressive but brittle in practice.
 https://blog.codinghorror.com/new-programming-jargon/ Number 4
Out of frustration I've broke out IDA Pro enough times to scratch my own itch that it's turned me to into a near Stallman-esque fan of open source software. When you work for small companies, bug reports on proprietary software issue trackers often get neglected.
I fear the usage of ORM are slowing the web and developers SQL knowledge is atrophying along with it. I miss the days of crushing a query from 0.4s to 0.0014s. I've see some pretty ludicrous SQL execution plans generated by ORMs.
Which is to say there's not going to be any sort of "victory" industry-wide here. Still, ranting is fun. My favorite recent-ish writing is linked from https://twitter.com/CarloPescio/status/903603552890834947 Programming as if the Domain (and Performance) Mattered
You can learn new things by keeping up with rants too. The best algorithm depends not only on the broad architecture but even the microarchitecture. Newer generations of Intel CPUs for instance throw many older optimizations out the window, but you'll still get the occasional suggestions for those optimizations from people who haven't kept up to date.
I'm not sure Ruby counts as a dynamic language du jour given how old it is, but I wish more dynamic languages followed the Lisp tradition instead of the interpreted Basic tradition, especially with capabilities that let you optimize dynamically as needed.
* (defun add-opt (x y)
(declare (optimize (speed 3) (space 3) (safety 0) (debug 0)) ; not all these matter in this example, just showing options
(type fixnum x y))
(the fixnum (+ x y)))
* (disassemble 'add-opt)
; disassembly for ADD-OPT
; Size: 9 bytes
; 02D0C2FF: 4801FA ADD RDX, RDI ; no-arg-parsing entry point
; 302: 488BE5 MOV RSP, RBP
; 305: F8 CLC
; 306: 5D POP RBP
; 307: C3 RET
I believe it still did when Gitlab started
Even if it's hard to maintain our mission-critical legacy systems, it's not as hard as maintaining the signaling in the New York City subways. Yet.
And nobody HAS to use Gnome or the wizziest GUI sugar on other systems.
Yes, it's discouraging when things don't work smoothly. But our job, as hackers, is to do our best to keep things running.
So "we" haven't collectively failed, just some of us.
I’m always surprised when I see a working complicated software, because I know that lots of projects, today are duck-taped patchwork.
Smartest minds never said that. Here is the correct quote: "Don't reinvent the wheel, but please learn how a wheel works".
These days we can't even roll out a client computer with that little RAM -- and the end users can't get any more work done than they could 20 years ago.
Win 10 LTSB on HDD would leave XP on SSD dead in the water, I believe - the sheer amount of crap that Windows has fed into the consumer and even the professional Windows editions, though, is unbelievable.
When starting a new project it makes absolutely sense to use ORM and a dynamic language; the cost of "wasted electricity" is negligible in front of the development cost saved.
We? NO. But those who refrained to learn SQL, yes. They've egregiously failed.
But more generally, if you're worried about this, choose to work on projects whose impact on the world (however you want to define it) is greater than the electricity your servers use. And if you're working in that direction, your speed to market is much more important than your electricity cost, or your support of legacy hardware. Design abstractions (ORMs, APIs) so that they can be optimized in the future, imagine that they'll be rewritten at that point in Go or Rust by dedicated teams, but in the meantime, stay laser-focused on your vision. This applies whether you're at the smallish vNext team at Microsoft, or a startup nobody's heard about. It's "performance debt," and it's often the most efficient way to capitalize your initiative.
Speaking of postmortems, reading through the Google Doc of their outage timeline, my first postmortem question would be why is the website dependent on the database being up? Shouldn't there at least be a fallback that says "We're offline right now sorry" when the database is unavailable? Or in the name of extreme transparency, have it frame or show a link to the google doc of the outage?
My second question would be how often do we test failures of a read slave, of the master, and of pgbouncer? Do we at least have a practiced procedure for manual failover if not an automatic failover?
My third question would be what sort of alerts do we have, if any, on replication lag?
It's been a while since I've had to run a major outage, this was an interesting exercise for me!
According to the outage Google Doc it's supposed to do this but it stopped working a little while ago:
16h33 | Yorick: Why have the 502s on GitLab.com switched to 500s?
16h46 | Slight concern that 500s have not yet gone away, but let’s focus on standing up postgres-04 first...
You can't do anything with GitLab without the database being available.
> Shouldn't there at least be a fallback that says "We're offline right now sorry" when the database is unavailable? Or in the name of extreme transparency, have it frame or show a link to the google doc of the outage?
We have a deploy page, but it requires us to explicitly enable it, which can be easy to forget. I do think adding a link to our Twitter status to the error pages could be useful, or maybe a widget of some kind.
> My second question would be how often do we test failures of a read slave, of the master, and of pgbouncer?
In the past we tested certain scenarios whenever necessary. For example, when introducing new parts of the database load balancer we'd test what happens if we terminate X or Y. However, we do not yet do periodic chaos monkey style testing, which has been on our todo list for quite a while.
> Do we at least have a practiced procedure for manual failover if not an automatic failover?
Failovers are actively worked on (https://gitlab.com/gitlab-com/database/issues/91, https://gitlab.com/gitlab-com/database/issues/86, https://gitlab.com/gitlab-com/database/issues/40). Right now our failover procedure is still quite painful, and it's easy to make mistakes. We held off using this approach for our maintenance because of these reasons, and also because a failover usually results in 2-3 hours of slow loading times. Ironically we ended up with just that.
> My third question would be what sort of alerts do we have, if any, on replication lag?
In this particular instance they go to a Slack channel called "#database", and this channel has been on fire with alerts since the problem started.
> We have a deploy page, but it requires us to explicitly enable it, which can be easy to forget. I do think adding a link to our Twitter status to the error pages could be useful, or maybe a widget of some kind.
Or make it automatic when the database isn't available?
> In this particular instance they go to a Slack channel called "#database", and this channel has been on fire with alerts since the problem started.
Good to know. My next question would be "were there any signs of impending doom before the event happened that could have warned us, and do we have alerts on that?"
I agree this is probably a good idea, so I created https://gitlab.com/gitlab-com/infrastructure/issues/4382 to keep track of things.
> were there any signs of impending doom before the event happened that could have warned us, and do we have alerts on that?
I don't think you can detect corrupt WAL until it is actually corrupt. In those case the host won't start up, for which we already have various errors that cover that case (e.g. there's an error for when too few transactions are happening).
There also wasn't really any impending doom before we started things. Two issues that triggered this were:
1. A lack of `set -e`
2. Apparently when we stop a command using `gitlab-ctl`, it will time out if the process doesn't stop fast enough. I'm not sure what exit status we report in that case, but in this instance it resulted in the script continuing to run.
I have to constantly be worrying about not putting "secret" things into the doc. Unless they have someone who's job is to take their real chat and transcribe it, in which case, that's great because that person is useful for a postmortem as well.
I think they did only have one person updating the document, from what I could as I was following the live updates - only very occasionally did I see someone other than 'Andrew' update the document.
I believe they use "GitLab Pages" to host their main, public-facing website. If Gitlab.com (the app I mean) is down, then so is their website.
To be completely fair, I'm on the free plans and have no rightful expectation of any sort of performance level. Having said that, I have considered moving to paid tiers, and I do advise others in regard to these sorts of services (whether cloud based or on-premise). Every time I see a planned maintenance, of "under 1m", I simply realize that I shouldn't plan anything important for GitLab that day. I can't see that this would be any different with GitLab.com paid plans and I have to imagine there is something inherently difficult in managing the software if the developer of that software has issues with common maintenance; this colors my impressions of what this might be like in-house. It seems a beast, I get that there are scaling issues that are different between something like gitlab.com and GitLab on-premise, but history of these things is coloring my impression.
At some point moving fast and breaking things needs to give way to the pursuit of enough stability that users aren't overly concerned about whether or not the tools they depend on, or more importantly the data they're storing, are regularly at risk.
I don't want to buy services/products from a company that is just trustworthy and open when things go wrong. I want to buy services/products from companies that only need to prove those qualities with some rarity.
Yes, PostgreSQL is a mature project. But that has little or no bearing by itself on what degree it can scale; Linux is just about as old, yet it still drives most of the infrastructure we're talking about. In context, PostgreSQL has been under continuous development since inception and no more so than in the past decade or so with substantial community and corporate backing. To conflate project age with current robustness is to indulge in fallacious thinking (in this case "cum hoc ergo propter hoc", if I'm not mistaken). The project has advanced with the times. There are good examples of PostgreSQL running at scale, including at companies operating at scale such as Instagram, Skype (up to at least the point of the Microsoft acquisition), Pandora, as well as others. Most of these, I'd bet, are/were using PostgreSQL at scales substantially larger than that faced by GitLab.
Relational databases require good management and good design to function properly. There were people who specialized in this called Data Base Administrators (at least the management piece anyway). My feeling (perhaps unjustified) is in start-upish environments there is a tendency to minimize DBA expertise in favor of having more conventional developers that can "get the database to work"... which is a different standard than "getting the database to perform". That's mostly not my world, so I may be jumping to conclusions (I'm an enterprise systems guy on most days).
I think GitLab probably has a complex software product and lacks the correct expertise in infrastructure (including, yes, DBAs) for their SaaS offering. I'm reading tea leaves, but that would be a perfect storm which would produce exactly what we see no matter how robust any one piece of this puzzle might be.
I knew someone here would misinterpret why I mentioned its age. As I said, it's not what it was designed for. Perhaps it has bolted on things in the 20+ years since it was first conceived, but it wasn't purpose-built for scale.
> To conflate project age with current robustness is to indulge in fallacious thinking
It only makes sense to talk about "robustness" in the context something was designed to work in. I can't abuse a system and call it unrobust. Of course Postgres is robust. You're twisting what I was saying.
> There are good examples of PostgreSQL running at scale, including at companies operating at scale such as Instagram
I'd actually be keen to learn more about how Postgres works at Instagram scale without their own customisations to make it work in that setting.
If Postgres works well in that setting why do things like this exist? https://github.com/citusdata/citus
I hope that's a typo in transcription and they actually used `set -e` to fix (and maybe `set -o pipefail`), `set -x` does something entirely different.
Yorick is doing the vacuuming. Vacuuming is not doing it’s thing. Why….
Still not vacuuming.
Yorick: still not sure why it’s not vacuuming.
We need a roomba.
Its got to the point where I am seriously considering moving our repos (~250) to a self hosted CE version.
Looking at the outage reports, it doesn't fill me with confidence. A lack of redundancy, no systematic testing of backups, a seeming lack of QA.
Gitlab has an outage of average once every two weeks, compared to github's once every 6 _months_ (look at the status pages)
Edit: ~15 mins later Instagram seems to be back up.
Edit #2: Seems to be back to 5xx server error ~1hr now.