
The Consul outage that never happened - dankohn1
https://about.gitlab.com/blog/2019/11/08/the-consul-outage-that-never-happened/
======
zedpm
> There is just no substitute for understanding how everything works.

Great line, and one I strongly agree with. I love the tooling we have
available today like Terraform, Ansible, etc. but the experience I gained by
keeping bare metal servers alive and happy with nothing more than a shell has
undoubtedly made me a much better admin.

~~~
lawnchair
Agree 100%. I'm an SRE team of one and I'm looking for an additional team
member. I've interviewed about a dozen people so far and one thing I've
noticed is that a lot of young engineers do not know the basics. They really
can't tell me how to log into a Linux box and troubleshoot. I think that's a
shame. You really lose a lot of insight if you don't understand how the
underlying pieces work.

~~~
t34543
Even worse I joined a new company in a senior technical role and it’s seen as
a negative that I prefer ssh/strace/tcpdump to debug problems.

~~~
ownagefool
It arguably is.

A good SRE needs to understand systems, as in automation of n computers.
Focusing on and prefering single system tools where you have to take manual
action points to an immaturity in dealing with complex distributed systems.

However, it's a common problem and most of the folks buying complicated
distributed tracing systems don't have particular skills in using them either,
so your skills are valuable, even if there could be better ways to do it.

Similarly if you focus on hiring SREs who know shell commands well, you might
lose more pertinent skills such as knowing what terraform is actually doing,
general programming skills, CI/CD and an understanding of cloud APIs.

Horses for courses; the more we know the better. Look for both sets of skills
in your teams and cross train as much as possible.

~~~
t34543
Arguably those who demonstrate understanding of low level fundamentals have
the drive to understand as many layers as possible.

Capturing AWS API calls through sslproxy not only implies you know what
terraform is doing but you also have a higher probability of solving difficult
problems.

All code boils down to an execution layer and having inspection ability at
that layer will always be valuable.

------
awinder
This was a fascinating read but I feel like I’m left with some more questions
about what’s going on under the hood in gitlab cloud:

1\. I thought gitlab was hosted in google cloud, but there’s a lot of
references to e.g., a hand rolled consensus system and self-managed database
clusters. I’m wondering if this event changes the math on build vs. buy at all
for gitlab, sounds like a lot of money has gone into this solution. How did
that solution come up, is it about some specific Postgres features that aren’t
available in googles hosted dbs or pricing?

2\. Again on the google cloud angle, why are servers being hand-managed and
rebooted? Elasticity in the cloud would make me think that the safest option
would be to stand up parallel infrastructure (like in a DR plan) and migrate
traffic. Was this just about speed of solution rollout? Does gitlab have plans
to harden DR plans so that you can execute in cases like this? Whenever
someone says they’re “in the cloud” and yet unable to treat servers like
cattle, I get a bit worried.

~~~
thaniri
1\. The consensus system you are talking about is Raft
([https://raft.github.io/](https://raft.github.io/)) and it's baked into
Consul.

2\. There are two ways to interpret parallel infrastructure. I will post my
thoughts on both. 2a. Standing up a parallel Consul cluster. This is
problematic because typically what people do with Consul is to put a Consul
agent on every server (or pod) which registers its services for discovery to
the Consul servers. When you make a parallel Consul cluster you need to also
restart the Consul agents on every other service. They only mention Postgres
in this blog post but there can potentially be a LOT of other servers
registered to the Consul cluster. 2b. Standing up everything it takes to run
Gitlab in parallel and then diverting traffic. Honestly sounds great. The
reason a team wouldn't do this is due to either a) not having infrastructure
code which allows for one-click deployment of whatever it takes to have Gitlab
running. or b) it's actually pretty expensive to do if you're not Google or
Amazon. The blog post mentions 255 clients (and 5 Consul servers). That's a
lot of servers to rebuild!

Now, I would love to hear from anyone else who uses Consul, because I have my
own thoughts on how they decided to handle the issue. I will focus my
attention entirely on the Consul and not the Postgres portion.

The blog post mentions two limitations: 1) Reloading the configuration of the
running service, which worked fine and did not drop connections, but the
certificate settings are not included in the reloadable settings for our
version of Consul. 2) Simultaneous restarts of various services, which worked,
but our tools wouldn't allow us to do that with ALL of the nodes at once.

We don't need to reload. We can run a rolling systemctl restart whicn ansible
is perfect for. The nice thing about this is that their stop-gap solution is
to disable TLS verification. This means that servers with TLS verification ON
in the meantime should be able to continue validating certs while other
servers can have a rolling restart that disables TLS verification one server
at a time. If we want to minimize downtime we would do every non-leader server
in the cluster, then finally the leader, then every client in a serial manner.
With 260 servers to deal with it would be slow but it shouldn't break Raft at
any point. There is no reason for quorum to be broken. The gossip will still
be communicated over TLS, just that some of the servers/clients wouldn't be
validating the certs.

Then, we would follow exactly the same process for rolling out valid
certificates with TLS validation turned back on. One non-leader server at a
time, then the leader, then every client.

I could be missing some critical piece here, and it looks like the Gitlab team
did run a lab test before making their change in prod. It's easy to miss a
possibility when under pressure, and also easy for an online commentator like
myself to think they are so much smarter. They still managed to get out of the
crisis with no downtime and congrats to the operators who pulled it off!

~~~
kilburn
I think what you are missing is that validating servers/clients would not
allow non-validating servers to rejoin the cluster (i.e.: servers/clients with
validation enabled will validate both outgoing _and_ incoming connections).

As I see it, by the time you restart the leader (and hence quorum switches to
the non-validating portion of the cluster) _all_ of your clients will suddenly
fail (they are still validating, and there's no good server for them to
connect to). Conversely, if you restarted the clients first they would all
become unavailable before the quorum switch happened.

------
org3432
> After looking everywhere, and asking everyone on the team, we got the
> definitive answer that the CA key we created a year ago for this self-signed
> certificate had been lost.

The GitLab outages always make the company seem disorganized and sloppy, and
unable to reflect on how to improve how they work. So they don't have a
central place to store their CA, and even after an outage, did they improve
anything about how they work?

It's ironic that the post seems geared towards recruiting, though I guess it's
honest, you know what you're getting into with that team.

~~~
gav
It would guess that the root cause for most outages that have a human factor
is disorganization and sloppiness because if that wasn’t the case there
wouldn’t be an outage.

It’s interesting to me that GitLab are so public and honest. I don’t think
that appeals to everyone, but it is a unique selling point to some.

~~~
drewcoo
Being public and honest is always cited when this happens to Gitlab. Which I
can say because my fragile memory recalls a number of incidents. This should
be alarming but apparently their psy ops is better than their dev ops because
we all react with fondness and awe. Maybe I should do more of _that_ at work!

~~~
vidarh
I think that is because HN has a lot of people who knows first hand that very
few places are free of these kind of issues.

In 25+ years of working in tech, I can honestly say I've never worked
_anywhere_ where there haven't been one or more serious issues where one or
more parts of the cause was something everyone _knew_ was a bad idea, but that
slipped because of time constraints, or a mistaken belief it'd get fixed
before it'd come back and bite people.

That's ranged from 5 people startups to 10,000 people companies.

Most of the time customers _and_ people in the company outside of the
immediate team only gets a very sanitized version of what happened, so it's
easy to assume it doesn't happen very often.

Gitlab doesn't seem like the best ever at operating these services, but they
also doesn't look any worse than average to me; which is in itself an
achievement, as most of the best companies in this respect tends to be
companies with more resources and that have had a lot more time to run into
and fix more issues. For a company their age, they seem to be doing fairly
well to me.

~~~
org3432
So they went off and implemented a brand new fancy service discovery tool for
I bet a problem they didn’t have, but couldn’t do the basics of tracking 2kb
of data for the CA. I don’t think that’s a age issues, that and there’s
nothing that prevents companies of any size from self reflection on what
they’re doing and what’s important.

Also what’s the point of transparency if you’re not getting critical feedback
from it and learning?

------
caleblloyd
> It is maintained by the Infrastructure group, which currently consists of 20
> to 24 engineers (depending on how you count)

20 if you count in Base-12 and 24 if you count in Base-10?

------
kitotik
It blows my mind they didn’t have sane PKI with that many resources. It seems
like even the “small” initial team of a couple devs, a manager, _and a
director_ would’ve at least spun up a vault instance to use as a CA.

Also, easy for me say from the peanut gallery, but don’t understand why they
couldn’t have done rolling consul restarts to update the configs, I’ve done
this many times on consul clusters.

~~~
mschuster91
> It blows my mind they didn’t have sane PKI with that many resources. It
> seems like even the “small” initial team of a couple devs, a manager, and a
> director would’ve at least spun up a vault instance to use as a CA.

Not mine. Inhouse CA management is a true PITA, even multiple-thousand-people
companies regularly fuck this up. I have experienced hours of outage because
someone failed to renew the certificate for one of the thousands of pieces
making up a Cisco network environment, and don't get me started on the drama
that is root CA certificate rollover, experienced this in three companies and
nowhere it was painless...

~~~
TheCondor
I have seen this more than a couple times, at big places with resources to
manage it. Is it just me or does the TLS and PKI tooling just seem weak? I
keep thinking there should be some badass tool that helps manage this sort of
thing, is there something I don’t know about?

~~~
mschuster91
It's not just the tooling that's weak, it's also terminology and education. If
you're not dabbling in crypto occasionally, half the OpenSSL manual and 100%
of its codebase will be like hieroglyphs... leading to the fact that most
organizations put the operation of their PKI to the one person who can
successfully manage to get a working HTTPS cert after copypasting shit from
Stackoverflow and wrangling with the validation tool of their certificate
vendor.

What also really bothers me is that there is no way that (assuming I own the
domain example.com) I can not get a certificate that allows me to sign
resources below example.com and that are verifiable by clients without messing
around with the system root trust store - and then, many pieces of software
carry their OWN trust store totally independent from the OS one (especially
Java, it's a true pain in the ass every two years to update that keystore so
that LDAPS works again)...

------
a2tech
If they were going through all this trouble and worry why not create a new CA
and drop the crrts from it on the hosts? That’s the work of just a few minutes
(plus some bash scripting to mass generate your host certs). If they had
already accepted that they were going to restart the services on all the hosts
anyway it would have saved them having to restart all the services again in
the future when they need to drop more certs.

~~~
imtringued
Yeah I was really wondering why they didn't just add a new CA. It would be
really weird if something as critical as consul was one the few program in the
world that only accept a single root certificate.

------
siscia
Maybe I am saying something stupid, but infrastructure services should be able
to use a dynamic set of keys.

If the first doesn't work, you try the second, and then the third and so so.

Similarly the clients, we should be able to dynamically adds certificate.

Our own key expires and our services are all about to drop connection seems
something that should not happen.

------
pronoiac
Consul issues have bitten me at two companies, and I heard word of it being
the culprit for some serious outages elsewhere. One possible takeaway here is
to remove it.

~~~
closeparen
The worst outages will always be those involving core infrastructure.

~~~
yclept
Consul seems to be more prone to issues than one would hope though. Imo the
feature set is not worth the increased complexity and operational burden.
There are simpler ways of handling service discovery and configuration without
running your own consensus based cluster.

~~~
kilburn
Genuine question: can you explain a couple of these simpler ways please?

~~~
closeparen
Central authorities are typically simpler than gossip and consensus systems.
They have failure modes too, of course, but those failure modes are better
understood and potentially easier to manage.

Sometimes you can't avoid the need for distributed consensus, but you can box
it inside a well defined abstraction like leader election, and then do
everything else in a traditional client-server way.

