
Learning to operate Kubernetes reliably - mglukhovsky
https://stripe.com/blog/operating-kubernetes
======
KaiserPro
Much as it burns me to admit this, for this usecase, jenkins is king. <60
nodes and its perfect.

At previous job, we had migrated from a nasty cron orchestration system to
jenkins. It did a number of things including building software, batch
generating thumbnails and moving data about on around 30 nodes, of which about
25 were fungible.

Jenkins job builder meant that everything was defined in yaml, stored in git
and was repeatable. A sane user environment meant that we could execute as
user and inherit their environment. It has sensible retry logic, and lots of
hooks for all your hooking needs. pipelines are useful for chaining jobs
together.

We _could_ have written them as normal jobs to be run somewhere in the 36k
node farm, but that was more hassle than its worth. Sure its fun, but having
to contend with sharing a box that's doing a fluid sim or similar, so we'd
have to carve off a section anyway.

However kuberenetes to _just_ run cron is a massive waste. It smacks of shiny
new tool syndrome. seriously jenkins is a single day deployment. transplanting
the cron jobs is again less than a day (assuming your slaves have got a decent
environment.)

So, with the greatest of respect, talking about building a business case is
pretty moot when you are effectively wasting what appears to be > two man
months on what should be a week long migration. Think gaffer tape, not carbon
fibre bonded to aluminium.

If however, the rest of the platform lives on kuberenetes, then I could see
the logic, having all your stuff running on one platform is very appealing,
especially if you have invested time in translating comprehensive monitoring
into business relevant alerts.

~~~
antoncohen
I disagree that Jenkins is king for this. Jenkins is a single point of
failure, is isn't a highly available distributed scheduler. It is a single
master with slaves. While it is easy to configure Jenkins jobs with code (Job
Builder, Job DSL, Jenkinsfiles), it is a pain to manage Jenkins itself with
code. Plugins, authentication, all the non-job configuration, that is usually
done via the GUI.

Saying Jenkins can be configured in a day, to the degree that Stripe
configured Kubernetes (with Puppet), is disingenuous. It would take more than
a day to do the configuration management of the slaves, getting the right
dependancies for all the jobs.

How to you isolate job executions in Jenkins? In Kubernetes each job
inherently isolated in containers. In Jenkins you have a bunch of choices. Do
you only run one executer per slave? OK, but then you have a bunch of wasted
capacity some of the time, and not enough capacity other times. You could
dynamically provision EC2 instances to scale capacity, but then you need a
setup to bake your slave AMIs, and you have potentially added ~3 minutes to
jobs for EC2 provisioning. You can run the jobs in Docker containers on the
slaves, that will probably get you better bin packing, but it doesn't have
resource management in the way Kubernetes does, so you could easily overload a
slave (leading to failure) while other slaves are underutilized.

Doing Jenkins right is not easy, there are solutions to all the problems, but
isn't just fire it up and it works.

Stripe was running Chronos before, which is a Mesos scheduler. So they have
experience with distributed cluster schedulers. They were probably comfortable
with the idea of Kubernetes.

They mention this as a first step to using Kubernetes for other things. So
they probably wanted to used Kubernetes for other things, and this seemed like
a low risk way to get experience with it. Just like GitHub started using
Kubernetes for their internal review-lab to get comfortable with it before
moving to riskier things ([https://githubengineering.com/kubernetes-at-
github/](https://githubengineering.com/kubernetes-at-github/)).

~~~
dominotw
> it is a pain to manage Jenkins itself with code. Plugins, authentication,
> all the non-job configuration, that is usually done via the GUI.

This is not true, all the configuration is scriptable via groovy scripts. We
run bunch of groovy startup scripts that configure everything post launch.
There is an effort to support this better[1] by jenkins team.

> How to you isolate job executions in Jenkins? In Kubernetes each job
> inherently isolated in containers.

We run one docker container/build on docker swarm. Each build gets its own
isolated/clean environment. There is no EC2 provisioning ect. We already own
and maintain docker swarm setup we just run jenkins/jenkins agents on it. I
assume if you are using kubernetes it would be similar setup.

> Jenkins is a single point of failure, is isn't a highly available
> distributed scheduler.

I agree with this to an extent. If you are running jenkins on scheduler it can
be rescheduled but you inflight jobs are dead.

1\. [https://github.com/jenkinsci/configuration-as-code-
plugin](https://github.com/jenkinsci/configuration-as-code-plugin)

~~~
vorg
> > it is a pain to manage Jenkins itself with code

> This is not true, all the configuration is scriptable via groovy scripts.
> [...] There is an effort to support this better[1] by jenkins team

The link you gave confirms it by saying managing Jenkins code "require you
know Jenkins internals, and are confident in writing groovy scripts". Neither
GUI's (like the one shown in your link) nor procedural languages (like Apache
Groovy, still procedural even though its collection API is crippled for
Jenkins pipelines) are very good for configuring software. Nor is an
unreadable declarative language (like XML).

A readable declarative language (like YAML, as shown in your link) is the
solution. Languages like Groovy were an over-reaction against unreadable XML
in the Java ecosystem. The correct solution is to switch from an unreadable to
a readable declarative language for configuring software.

~~~
dominotw
> Languages like Groovy were an over-reaction against unreadable XML in the
> Java ecosystem. The correct solution is to switch from an unreadable to a
> readable declarative language for configuring software.

I somewhat agree with you. Unfortunately Jenkins team seems to have bet in the
opposite direction by going full groovy
[https://github.com/jenkinsci/pipeline-
examples](https://github.com/jenkinsci/pipeline-examples)

------
alexebird
I always search for mentions of Hashicorp Nomad in the comments section of
front-page Kubernetes articles like this. There are often few or no mentions,
so I’d like to add a plug for the Hashistack.

For some reason Nomad seems to get noticeably less publicity than some of the
other Hashicorp offerings like Consul, Vault, and Terraform. In my opinion
Nomad is right up there with them. The documentation is excellent. I haven’t
had to fix any upstream issues in about a year of development on two separate
Nomad clusters. Upgrading versions live is straightforward, and I rarely find
myself in a situation where I can’t accomplish something I envisioned because
Nomad is missing a feature. It schedules batch jobs, cron jobs, long running
services, and system services that run on every node. It has a variety of job
drivers outside of Docker.

Nomad, Consul, Vault, and the Consul-aware Fabio load balancer run together to
form most of what one might need for a cluster scheduler based deployment,
somewhat reminiscent of the “do one thing well” Unix philosophy of
composability.

Certainly it isn’t perfect, but I’d recommend it to anyone who is considering
using a cluster scheduler but is apprehensive about the operational complexity
of the more widely discussed options such as Kubernetes.

~~~
akvadrako
I never heard of Nomad, but I can't see why I would choose it over the much
more popular and standardised k8s.

The biggest benefits seem to be

(1) simplicity, but GCE and minikube are easy enough to learn in a day and

(2) ability to run non-containers, but docker containers are generic - they
can run java apps just fine.

~~~
zie
I would argue the biggest strength is maintainability. Managing and keeping up
a distributed cluster with k8s is WORK. If you are not at the scale where you
can dedicate full-time staff to managing only k8s, you shouldn't even be
touching k8s. You need full-time staff to keep it alive.

Nomad is operationally simple, you can run it out of your normal devops roles,
you don't need dedicated staff. Mostly because you can pretty easily wrap your
head around what it does and how it works.

This saves you bundles of cash and time.

~~~
akvadrako
I don't see why - I have my GCE cluster running fine with zero maintenance
work.

~~~
zie
Zero maintenance work implies you are not doing security patches or upgrades,
so as soon as you have a problem, not only will you be left holding the now
broken pieces, nobody will have any reason to help or support you, unless you
pay them $$$$$$'s(and even then.... maybe not).

I hope whatever you are running under k8s isn't crucial or important, and I
really hope I'm not a customer of whatever you "operate".

Maintenance is real, that applies to everything if you want it to work
reliably for any length of time. There are various ways to handle maintenance,
do a little consistently and constantly (what most of us professionals do) or
do large bulk-replacements every X time (like when stuff crashes and burns -
and nobody can remember how to fix it, so they just replace it with whatever
is new and shiny).

~~~
kemitche
GCE is a hosted k8s. Google does the maintenance for you, to my understanding.

~~~
zie
AH! sorry. I didn't realize Google started offering hosted k8s.. That def.
keeps maintenance down, since Google does it for you. It's been a while since
I've dug into k8s in depth.

~~~
ownagefool
Google, Amazon, Microsoft and IBM all offer managed kubernetes.

~~~
zie
Cool. This definitely makes it easier to _use_ k8s, but that's very different
from _running_ k8s. My comment(s) are geared about running k8s yourself. My
systems are all on physical hardware we own, hence I don't really pay a lot of
attention to the latest and spiffiest in hosted platforms.

------
mephitix
Setting aside the k8s content itself, I love the way this article is written.
It's not a typical tutorial or tips/tricks but takes you time-traveling
through the experience of a big company adopting nascent tech. Lot of great
things to take away even outside of the kubernetes tips.

~~~
unmole
Julia Evans is something of a celebrity. Her personal blog is an absolute gold
mine: [https://jvns.ca](https://jvns.ca)

------
robszumski
> “Sometimes when we do an etcd failover, the API server starts timing out
> requests until we restart it.”

This is likely related a set of Kubernetes bugs [1][2] (and grpc[3]) that
CoreOS is working diligently to get fixed. The first set of these, the
endpoint reconciler[4], has landed in 1.9.

More work is pending on the etcd client in Kubernetes. The good news is that
the client is used everywhere, so one fix and all components will benefit.

[1]:
[https://github.com/kubernetes/community/pull/939](https://github.com/kubernetes/community/pull/939)
[2]:
[https://github.com/kubernetes/kubernetes/issues/22609](https://github.com/kubernetes/kubernetes/issues/22609)
[3]:
[https://github.com/kubernetes/kubernetes/issues/47131](https://github.com/kubernetes/kubernetes/issues/47131)
[4]:
[https://github.com/kubernetes/kubernetes/pull/51698](https://github.com/kubernetes/kubernetes/pull/51698)

~~~
pishpash
I don't get this. Didn't Kubernetes come out of Google Borg that had been in
use forever? The second write should be more elegant and impressive -- why so
many basic bugs?

~~~
alpb
Kubernetes takes some concepts from Borg. A system like Borg would be very
closely coupled to Google‘s infrastructure that there’s probably very little
to open source from there without open sourcing the entire machinery.

Also, any large scale system like Borg developed at a large company like
Facebook or Google will have completely opinionated one-way-of-doing-things
for a lot of aspects. This doesn’t work for the world outside where lots of
developers from different backgrounds, lots of projects with different
requirements exist.

~~~
obeattie
I think this bit from "Borg, Omega, and Kubernetes"[1] (which is an excellent
read) sheds light on this:

> The Borgmaster is a monolithic component that knows the semantics of every
> API operation. It contains the cluster management logic such as the state
> machines for jobs, tasks, and machines; and it runs the Paxos-based
> replicated storage system used to record the master’s state.

So it sounds as though Borg includes its own storage system. As I understand,
Google has a set of (very complex) libraries written in C++ that implement
Paxos/Multi-Paxos[2], which they have not open sourced.

[1]
[https://research.google.com/pubs/pub44843.html](https://research.google.com/pubs/pub44843.html)
[2]
[https://research.google.com/archive/paxos_made_live.html](https://research.google.com/archive/paxos_made_live.html)

------
scarface74
I'm curious about what people think about HashiCorp's Nad bs Kubernetes.

I chose Nomad because I'm already using Consul and I wanted to run raw .Net
executables. Would it have been worth it to use Docker with .Net Core?

Not trying to change my infrastructure now, but just curious about whether it
is worth the time to play with it on the side.

~~~
wmf
Nomad appears to be better designed, more scalable, and easier to operate than
k8s, but it will fall behind pretty rapidly since k8s has 100x more
developers.

~~~
pm90
That isn't necessarily true (playing devils advocate): OpenStack had gajillion
developers and still failed (mostly).

Although k8s does seem to be designed much better. I use it personally too and
hope for its success.

------
YesThatTom2
Such good writing style AND useful technical content. Why can't all blog posts
be this good?

~~~
nindalf
The author writes regularly and her posts almost always reach the top of HN.
Like most skills, improvement comes with practice. If a person is willing to
put in the same time and effort as jvns has, I'm sure they would be rewarded
with similar results.

------
djsumdog
I haven't been at a k8s shop yet, but at my last job we used Marathon (on
DC/OS). I know you can run Kubernetes on DC/OS, but the default scheduler it
comes with is Marathon.

Is there an advantage to one over the other? It looks like in both cases, you
need a platform team (at least 2, maybe 3 people; we had a large complex setup
and had like 10) to setup things like K8s, DC/OS or Nomad, because they are
complex systems with a lot of different components .. components like Flanel
vs Weavenet vs some other container networks, handling storage volumes, labels
and automatic configuration of HAProxy from them (marathon-lb on DC/OS).

All schedulers (k8s, swarm, marathon) seems to use a json format for job
information that's pretty specific, not only to the scheduler, but to the way
other tooling is setup at your specific shop.

------
perfmode
Why do you need a 99.99% from job completion rate? Why not just design for
failure and inevitable retries? Almost seems like you grant platform users a
false sense of security by making it very reliable but not perfect.

~~~
sisk
My guess: because financial systems.

A lot of traditional financial instruments 1) are not resilient to failure and
2) run at fixed times in batches. I’m confident it’s not their own systems
that set the requirement of rigidity.

------
ad_hominem
How do you deal with sidecar containers in CronJobs (and regular batch Jobs)
not terminating correctly?

[https://github.com/kubernetes/kubernetes/issues/25908](https://github.com/kubernetes/kubernetes/issues/25908)

~~~
jvns
We don't run sidecar containers in cron jobs yet. That said, here's a
workaround (from that issue):
[https://github.com/kubernetes/kubernetes/issues/25908#issuec...](https://github.com/kubernetes/kubernetes/issues/25908#issuecomment-308569672)

~~~
ad_hominem
I'm aware of the workarounds in that thread. Just wondering if Stripe had a
different workaround but I guess not.

~~~
jmillikin
That GitHub comment is Stripe's workaround! I copied it nearly as-is from our
internal job setup boilerplate.

------
asimpletune
What is the benefit of using Kubernetes over Mesos (or in conjunction with
Mesos)?

~~~
vicaya
FTFA: "We’d previously been using Chronos (with Mesos) as a cron job
scheduling system, but it was no longer meeting our reliability requirements
and it’s mostly unmaintained (1 commit in the last 9 months, and the last time
a pull request was merged was March 2016) Because Chronos is unmaintained, we
decided it wasn’t worth continuing to invest in improving our existing
cluster."

Though Chronos has a release recently with a bunch of fixes, Mesos is
inevitably fading as a legacy platform.

~~~
asimpletune
> Mesos is inevitably fading as a legacy platform.

Because of Chronos? This is a bizarre thing to say. Mesos actually works
extremely well. Whenever I ask the why kube over Mesos question, I never get a
good answer. I think because people just don’t know Mesos. Also it wasn’t made
by google.

~~~
vicaya
Chronos is just an example. There're many bugs in Mesos that don't get fixed
for months/years. Mesos core is legacy (pre 11) C++ code nobody wants to
maintain.

~~~
pm90
> Mesos core is legacy (pre 11) C++ code nobody wants to maintain.

This is actually very very VERY important. Go is a lot more concise (IMO) that
C++, generally when I'm curious about how something works in a project written
in Go, its much easier to follow the logic.

~~~
asimpletune
Go is nice. I like it a lot. It's very readable, it reduces the number of good
ways you can do something to usually just one, it's fast, fat binaries are
awesome, great concurrency primitives, etc...

------
minimaxir
Kubernetes very recently added native Cronjob support:
[https://kubernetes.io/docs/concepts/workloads/controllers/cr...](https://kubernetes.io/docs/concepts/workloads/controllers/cron-
jobs/)

How does Stripe's approach differ?

~~~
tarmstrong
No difference — we are using Kubernetes's native cronjob support. This post is
about how we migrated to that system.

