
The Horrors of Upgrading Etcd Beneath Kubernetes - twakefield
https://gravitational.com/blog/kubernetes-and-offline-etcd-upgrades?
======
philips
Hey thanks for the article. I know etcd upgrades can look complex but
upgrading distributed databases live is always going to be quite non-trivial.

That said for many people taking some downtime on their Kube API server isn't
the end of the world. The system, by design, can work OK for sometime in a
degraded state: workloads keep running.

A few things that I do want to try to clarify:

1) The strict documented upgrade path for etcd is because testing matrixes
just get too complicated. There aren't really technical limitations as much as
wanting to ensure recommendations are made based on things that have been
tested. The documentation is all here:
[https://github.com/coreos/etcd/tree/master/Documentation/upg...](https://github.com/coreos/etcd/tree/master/Documentation/upgrades)

2) Live etcd v2 API -> etcd v3 API migration for Kubernetes was never a
priority for the etcd team at CoreOS because we never shipped a supported
product that used Kube + etcd v2. Several community members volunteered to
make it better but it never really came together. We feel bad about the mess
but it is a consequence of not having that itch to scratch as they say.

3) Several contributors, notably Joe Betz of Google, have been working to keep
older minor versions of etcd patched. For example 3.1.17 was released 30 days
ago and the first release of 3.1 was 1.5 years ago. These longer lived
branches intend to be bug fix only.

~~~
geertj
> upgrading distributed databases live is always going to be quite non-
> trivial.

I understand that the implementation of live upgrades for a distributed
database will be complex but this post is about the user experience. Given
enough resources, is there a reason that it can't be a single "upgrade now"
command? Or maybe slightly more real-world, a 3 step process like: "stage
update" -> "test update" -> "start update".

~~~
philips
That is the process:
[https://github.com/coreos/etcd/blob/master/Documentation/upg...](https://github.com/coreos/etcd/blob/master/Documentation/upgrades/upgrade_3_1.md#upgrade-
procedure)

    
    
      for i in members
        Replace etcd binary
        Restart etcd process

------
peterwwillis
I don't know about you, but my application is tested on a single
platform/stack with a specific set of operations. When the operation of the
thing I'm running on changes, my application has changed. It just can't be
expected to run the same way. Upgrade means your app is going to work
differently.

Not only is the app now different, but the upgrade itself is going to be
dangerous. The idea that you can just "upgrade a running cluster" is a bit
like saying you can "perform maintenance on a rolling car". It is physically
possible. It is also a terrible idea.

You can do some maintenance while the car is running. Mainly things that are
inside the car that don't affect its operation or safety. But if you want to
make significant changes, you should probably stop the thing in order to make
the change. If you're in the 1994 film _Speed_ and you literally can't stop
the vehicle, you do the next best thing: get another bus running alongside the
first bus and move people over. Just, uh, be careful of flat tires.
([https://www.youtube.com/watch?v=rxQI2vBCDHo](https://www.youtube.com/watch?v=rxQI2vBCDHo))

~~~
taeric
Picking a dependency for a system that does not have a live method of
updating, is a terrible idea for a software system.

Physical systems, like a car, have obvious limitations on what can be modified
when. Similarly, software will have some limitations on what happens when you
are updating. But accepting "upgrades can't be done easily" for software is
putting much more limitations on the software than makes sense.

~~~
peterwwillis
With the exception of like, binary patching of executables that use versioned
symbols or some craziness like that, virtually all software cannot be upgraded
while it is running _and expected not to produce errors_.

I mean, if you use a plug-in style system, you can program it to block
operations while a module is reloaded or something. But most software is not
designed this way. Especially with ancient monolithic models like Go programs.

Upgrades just can't be done easily in a complex system. You can do them
without concern for their consequences, but that doesn't mean they're safe or
reliable methods.

------
amorousf00p
I'm old school. I look at containers as jails and all the work to isolate
applications in containers as of indifferent value given a flat plane process
scope with MAC and application resource controls in well designed
applications.

That is I default to good design and testing rather than boilerplate
orchestration and external control planes.

All containers have done (popularly) in my opinion is add complexity and
insecurity to the OS environment and encouraged bad behavior in terms of
software development and systems administration.

------
mirceal
This may be an unpopular opinion, but I’m not a big fan of containers and K8S.

If your app needs a container to run properly, it’s already a mess.

While what K8s has done for containers is freaking impressive, to me it does
not make a lot of sense unless you run your own bare metal servers. Even then,
the complexity it adds may not be worth it. Did I mention that the tech is not
mature enough to just run on autopilot and now instead of worrying about the
“devops” for your app/service you are playing catch-up with upgrading your K8s
cluster?

If you’re in the cloud, VMs + autoscalling or fully managed services (eg S3,
lambda, etc) make more sense and allow you to focus on your app. Yes there is
lock-in. Yes, if not properly arhitected it can be a mess.

I wish we would live in a world where people pick simple over complex and
think long term vs chasing the latest hotness.

~~~
outworlder
> This may be an unpopular opinion, but I’m not a big fan of containers and
> K8S.

It is unpopular for a reason.

Disclaimer: if you can run solely on cloud managed services + serveless,
please do that and do not even look at the rest of this message. This is a
very nice approach, although there are some things you need to setup before
calling victory (deployment pipeline is one). And, as you mentioned, there is
vendor lock-in.

Now, containers. Look, no-one WANTS containers. Or VMs. Or anything else. We
just want to run our stuff. It just so happens that containers are one of the
most useful abstractions there are. Unless someone else comes up with a new
abstraction, containers it is.

Because a container is at the end of the day, a process. Are you against
processes? Or against process isolation in general?

You cannot lift an existing service and run serveless, you need to modify it.
In many cases, it is not practical. In other cases, they need to be an actual
server-like application and hold a connection. Lambda doesn't help there.

> While what K8s has done for containers is freaking impressive, to me it does
> not make a lot of sense unless you run your own bare metal servers.

One thing has nothing to do with the other. These are different levels of
abstraction, there are challenges when running bare metal servers which are
not present in cloud environments. Kubernetes can do so much more in a cloud
environment (persistent volume claims, etc). Rolling out network attached
storage on bare metal servers is a pain. It is also a pain with Openstack, but
at least there is a standard interface there.

> If you’re in the cloud, VMs + autoscalling or fully managed services (eg S3,
> lambda, etc) make more sense and allow you to focus on your app.

Sorry, I respectfully disagree. I have spent the last two months implementing
automation for deploying a cluster on AWS, with auto-scaling, auto-healing,
the works, automatically deployed through Jenkins. It is NOT easy, it is not
simple, and it is not focusing on my end application, _unless_ you are
ignoring all the technical debt you are incurring. And we DO have several k8s
clusters, I will be moving that crap to k8s as soon as I can.

Let me make a quick list of what you need:

Prep:

You can use a barebones (ubuntu|redhat|coreos|etc) VM. In which case,
provisioning is not complete once the VM is up by the ASG, you need to install
the app. If you use an AMI, you now need to build automation to construct
these AMIs. Note: if at this stage you are building AMIs by hand, this is a
technical debt, which you will have to pay. Alternatively, you can use
something like cloud init. If so, see below:

* Create the auto-scaling group * Create the launch configuration * If your AMI is not entirely complete, add user data (or equivalent) here * Set the health checks

And you are done! Right? No.

What about log rotation? Do you have centralized logging? No, tech debt. Go
set it up. What about monitoring? Are you using cloudwatch? Prometheus? Go set
that up. What about alerting? Not everything requires a VM to be destroyed,
you need to set it up. What about upgrades? Are these cattle servers? Then you
have to modify your AMI and launch config. Go automate this (tech debt if not)
How are you controlling access? Do you have a team? Are they allowed to SSH?
Where and how are you storing the keys? How do you invalidate if a key gets
compromised? I could go on, but let's keep at this level because the point is
to draw a comparison.

With K8s, here's what you do:

Create a container image. Dockerfile, fancy Jenkins script, some other
mechanism, I don't care. Create an image, put it in a registry somewhere.
Create a YAML file describing your 'deployment'. It can be a few lines of code
if you don't care about most of the stuff. If you need external access, you
can create a service, which is another YAML If you don't have an existing
HTTPS load balancer, point to the k8s workers (trivial with something like
ingress on GKE)

And you are done. This automatically gets you:

• Self healing • Scheduling among worker nodes. You can control it or let K8s
decide • Bin-packing • Logging (centralized logging requires a one-time step,
with fluentd or similar, may be handled by cloud providers) • Similarly,
monitoring and alerting require a one-time investment in deploying something
like Prometheus, after that is done. Getting prometheus to scrape your pods is
very easy to do, easier than deploying in a VM by VM basis • Upgrades:
deployments handle that for you. Even replica sets before it, you just needed
to apply a new YAML with an updated version • There are no SSH keys to mess
around. K8s has certificate-based user access control, with an optional RBAC •
The SSH equivalent is kubectl exec • Service discovery: you have DNS records
for all local services created for you. The cluster will direct you to the
correct node. • Scaling is trivial, but most importantly, quick. kubectl scale
deployment --replicas=X. It only takes whatever time is required to download
the image and run it. You don't have to spin up a whole operating system •
Optional: you can have horizontal pod auto-scaling, so your services can scale
up and down automatically.

It is not perfect, but it can be a game changer. I cannot imagine how we would
be running our operation without K8s. Actually I can: version 1.0 of the app
was a bunch of VMs, one for each service. It was nightmarish. Now the push,
company-wide, is to move everything to K8s. All VMs, all data stores, all of
it. And it has absolutely nothing to do with hype, it has everything to do
with proven advantages, compared to most of other alternatives.

I guess you could also do Mesos. They have a similar concept, only it's not
K8s.

Note that SOMETHING needs to run the K8s cluster itself. That something is
precisely your auto-scaling groups and VM images. It is less painful with a
container-optimized OS (like CoreOS or whatever Google uses)

~~~
mirceal
A container is not a process. The fact that you say this makes me wonder if
you understand what K8s and how it works.

K8s is not going to solve the issues you outline above (logs, proper
monitoring, etc). Even worse, you’re gonna have a bad time migrating them to a
proper solution.

~~~
vel0city
While k8s does not in and of itself solve some of the issues pointed out
above, it does centralize a lot of these problems. These centralized problems
are then often easily addressed by the cloud providers actually running the k8
cluster. GKE automatically sends log data and metrics through fluentd to their
Stackdriver platform including error alerting. If something even prints
something that looks like a stack trace to stdout or stderr, Stackdriver sends
alerts and creates an issue for us.

------
djb_hackernews
The clustering story for etcd is pretty lacking in general. The discovery
mechanisms are not built for cattle type infrastructure or public clouds. ie
it is difficult to bootstrap a cluster on a public cloud without first knowing
the network interfaces your nodes will have or it requires you to already have
an etcd cluster OR use SRV records. From my experience etcd makes it hard to
use auto scaling groups for healing and rolling updates.

From my experience consul seems to have a better clustering story but I'd be
curious why etcd won out over other technologies as the k8s datastore of
choice.

~~~
nvarsj
> From my experience consul seems to have a better clustering story but I'd be
> curious why etcd won out over other technologies as the k8s datastore of
> choice.

That'd be some interesting history. That choice had a big impact in making
etcd relevant, I think. As far as I know, etcd was chosen before kubernetes
ever went public, pre-2014? So it must have been really bleeding edge at the
time. I don't think consul was even out then - it might have been they were
just too late to the game. The only other reasonable option was probably
ZooKeeper.

~~~
robszumski
I was around at CoreOS before Kubernetes existed. I don't recall exactly when
etcd was chosen at the data store, but the Google team valued focus for this
very important part of the system.

etcd didn't have an embedded DNS server, etc. Of course, these things can be
built on top of etcd easily. Upstream has taken advantage of this by swapping
the DNS server used in Kubernetes twice, IIRC.

Contrast this with Consul which contains a DNS server and is now moving into
service mesh territory. This isn't a fault of Consul at all, just a desire to
be a full solution vs a building block.

~~~
otterley
My understanding is that Google valued the fact that etcd was willing to
support gRPC and Consul wasn't -- i.e., raw performance/latency was the gating
factor. etcd was historically far less stable and less well documented than
Consul, even though Consul had more functionality. etcd may have caught up in
the last couple years, though.

~~~
smarterclayton
At the time gRPC was not part of etcd - that only arrived in etcd 3.x.

The design of etcd 3.x was heavily influenced by the Kube usecase, but the
original value of etcd was that

A) you could actually do an reasonably cheap HA story (vs Singleton DBs)

B) the clustering fundamentals were sound (zookeeper at the time was not able
to do dynamic reconfiguration, although in practice this hasn’t been a big
issue)

C) consul came with a lot of baggage that we wanted to do differently - not to
knock consul, it just overlapped with alternate design decisions (like a large
local agent instead of a set of lightweight agents)

D) etcd was the simplest possible option that also supported efficient watch

While I wasn’t part of the pre open sourcing discussions, I agreed with the
initial rationale and I don’t regret the choice.

The etcd2 - 3 migration was more painful than it could be, but most of the
challenges I think were excacerbated by us not pulling the bandaid off early
and forcing a 2-3 migration for all users right after 1.6.

------
akeck
To sidestep upgrade issues, we're pursuing stateless immutable K8S clusters as
much as possible. If we need new K8S, etcd, etc., we'll spin a new cluster and
move the apps. Data at rest (prod DBs, prod Posix FS, prod object stores,
etc.) is outside the clusters.

~~~
SteveNuts
Where do you run your persistent apps?

~~~
brianwawok
Not the OP, but I have two kinds of persistent data.

1) Images / files / etc. It all lives in cloud storage ("s3"), outside of K8s

2) RDBMS data. You can just run as hosted sql (say CloudSQL) or a not-in-k8s
VM. I have found no compelling reason to move my RDBMS into my k8s cluster.

~~~
zbentley
That's a bit distressing. Most everywhere I've worked, the infrastructure that
matters to the business has fallen into roughly two categories:

Category 1: stateless-"ish" workloads. More than 90% of hosts/containers used
. . . less than 25% of operations headaches and time. Issues that happen here
are solvable with narrow solutions: add caches, scale out, do very targeted,
transparent fixes to poorly-performing application code.

Category 2: stateful workloads. Less than 10% of hosts/containers. 75% or more
of operations headaches and time. Issues that happen here have less
visibility, fewer short-term fixes ("just add an index and turn off the bad
queries" only works so many times before you're out of low-hanging fruit), and
require more expertise to solve in a way that doesn't require the
application/clients to change.

If k8s and other next-gen technologies are only easing the first category,
that makes me sad. It's like we have this sedan (off-the-shelf web
technologies) that we have to take off-roading and it falls apart all the
time. I don't want a better air conditioning system and more cushions in my
seats; I want the vehicle to not break.

~~~
wereHamster
In k8s you can easily host stateful services. You have persistent disks that
you can attach to containers, and you also have StatefulSets if you have a
stateful service that you want to have automatically scaled
([https://kubernetes.io/docs/concepts/workloads/controllers/st...](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/)).
You can use both to run a database (postgres) for example.

~~~
zaat
I thought that this was the case too, but OP had a link
([https://gravitational.com/blog/running-postgresql-on-
kuberne...](https://gravitational.com/blog/running-postgresql-on-kubernetes/))
to previous post that got me worried.

~~~
outworlder
From TFA

> Kubernetes is not aware of the deployment details of Postgres. A naive
> deployment could lead to complete data loss.

That sounds ominous, but is actually a tautology.

You have the exact same challenges anywhere else, but since K8s makes some
operations so easy to do, you need to be careful. RDBMs are specially tricky
because most of them expect a single "master" which holds special status. And
it so happens to hold all your data too (as do your replicas, provided they
are up to date).

------
outworlder
This article really hits home.

A K8s cluster can survive just about anything. Worker nodes destroyed, meh,
scheduler will take care of bringing stuff up. Master nodes destroyed. Meh. It
doesn't care.

ETCD issues though? Prepare for a whole lot of pain. They are very uncommon
though. Upgrading is the most frequent operation.

------
segmondy
I'll have to read this later this weekend, my home k8s cluster that broke did
so because of etcd. Grrr

------
fen4o
From my experience, running etcd in cluster mode simply creates too many
problems. It can scale vertically very well and if you run etcd (and other
Kubernetes control plane components) on top of Kubernetes you can get away
with running only a single instance.

------
jacques_chester
Etcd misbehaving during upgrades or when a VM was replaced was a _massive_
source of bugs for Cloud Foundry.

There is no longer an etcd anywhere in Cloud Foundry.

~~~
ec109685
Aren’t they introducing Kubernetes as part of Cloud Foundry:
[https://techcrunch.com/2018/04/20/kubernetes-and-cloud-
found...](https://techcrunch.com/2018/04/20/kubernetes-and-cloud-foundry-grow-
closer/)

~~~
jacques_chester
Sort of. Kinda. It's complicated. But insofar as Kubernetes becomes the
container orchestrator, I imagine we will encounter some or all of those
problems again.

Cloud Foundry's current orchestrator, Diego, is of similar vintage to
Kubernetes. It now relies entirely on relational databases for tracking
cluster state. Ditto other subsystems (eg, Loggregator). It scales just fine.
MySQL, while not my personal favourite, has proved more reliable in practice
than etcd. Some folks use PostgreSQL. Also more reliable in practice.

Paying customers care more about being more reliable in practice than being
more reliable in theory.

I've ha-ha-only-seriously suggested we throw engineering support behind non-
etcd cluster state. For example:
[https://github.com/rancher/k8s-sql](https://github.com/rancher/k8s-sql)

------
a2tech
What problems exactly is etcd trying to solve?

~~~
aberoham
Interviewer here -- "If you want to form a multi-node cluster, you pass it a
little bit of configuration and off you go." .. "Raft [etcd] gets you past
this old single-node database mantra, which isn’t really particularly old or
even a wrong way to go. Raft gives you a new generation of systems where, if
you have any type of hardware problem, any type of software instability
problem, memory leaks, crashes, etc., another node is ready to take over that
workload because you are reliably replicating your entire state to multiple
servers."

------
king_nothing
Lol. CoreOS and Hashicorp products often throw “cloud” and “discoverability”
around but lack crucial features for ops supportability found in solutions
that came before. Zookeeper, Cassandra, Couchbase didn’t evolve in a
development vacuum chamber. New != better.

