
What we learned after a year on Kubernetes - ecliptik
https://about.gitlab.com/blog/2020/09/16/year-of-kubernetes/
======
t3rabytes
> Increased billing from cross-AZ traffic

Yep, we (Basecamp) have been bit by this with EKS too. By default the ALB
Ingress Controller will put all cluster nodes into the target group and a
request can hit kube-proxy in any AZ before being directed to the right place,
causing a lot of inter-AZ traffic churn. It's mildly annoying.

(with alb-ingress-controller you can change target-type to IP to have traffic
go direct to pods that will actually back a given service instead of via kube-
proxy on a random node, but the big fault there is that if alb-ingress-
controller gets hung up on something and you deploy, you can send traffic to
nowhere because the target group still has old IPs for pods that don't exist).

~~~
GauntletWizard
It's so worth it, though. You may not realize it, but your Healthchecks don't
work unless you are using target-type IP. In the standard mode, AWS
healthchecks each kubernetes node, but not your pods themselves. It will keep
sending traffic to pods that are unhealthy (such as in state: terminating) if
it already has connections open through kube-proxy to them, and stop sending
traffic to healthy pods if the connection to them was through a node it
considers "unhealthy". I wrote at tool[1] to handle this at Houseparty before
alb-ingress-controller supported this mode, and it gained us a whole nine as
measured by the ALB.

The same failure mode actually exists with instances, though they typically
rotate much more slowly. Monitor your ALB-ingress-controller and treat it as
critical infrastructure, because it is.

[1][https://github.com/GauntletWizard/targetgroupcontroller](https://github.com/GauntletWizard/targetgroupcontroller)

~~~
rsanders
Can you expand on this: "AWS healthchecks each kubernetes node, but not your
pods themselves".

Are you talking about a keepalive connection to an unhealthy pod which is
reused for multiple requests? So the failure modes are, if I understand you
correctly, a) the ALB keeps sending requests through an established keep-alive
HTTP connection which terminates in an unhealthy pod, but which it sees as
healthy because the node is healthy and can route traffic to another, healthy
pod, and b) the health of an established HTTP keepalive connection is
perceived to be that of the node rather than the destination pod, so nodes
which become unhealthy can cause the ALB to unnecessarily terminate a
keepalive connection.

We had to switch to using target-type=instance because of issues with pods not
being deregistered. I'd prefer to use target-type IP but it seemed like
preventing 500s on rollouts required a bit of testing and tuning with a very
specific approach. e.g. introducing a longish delay on pod termination with a
lifecycle hook and using the pod readiness gate support recently added to alb-
ingress-controller.

~~~
GauntletWizard
You've got it exactly right. Your problem of pods not being deregistered is a
real problem, but also with a quick fix: The default "Deregistration delay"
for ALBs is 300 seconds but for kubernetes pods the TerminationGracePeriod
defaults to 60 seconds. This means that your load balancer keeps trying that
pod for 4 whole minutes after it's been hard-shutdown.

Here's the annotation that I used to fix that:

    
    
        alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30,slow_start.duration_seconds=30

------
pjmlp
I learned not to use it and wait until the k8s fad goes away.

~~~
Keycap
Why would you think it would go away?

Its literaly the best option currently for what it is doing.

It has an unprecedented support behind it as well.

Multiply Vendors support it through a certificate k8s managed service.

It solves really problems out ouf the box like load balancing, ingress, cert
management, autoscaling, health checks, autorepair.

It allows for simple IaaC.

Is it young? Yes. Do we need more people with more expierence? yes.

Is this a problem? No.

~~~
thu2111
I dunno, because of writeups like GitLabs? We're lucky GitLab is so
transparent. It lets us see that their claims don't appear to always match the
actual reality.

Here's an example.

Blog post says: _" After transitioning each service, we enjoyed many benefits
of using Kubernetes in production, including much faster and safer deploys of
the application, scaling, and more efficient resource allocation"_

Wrong. Actual analysis by the engineers of one of their migrated jobs (urgent-
other sidekiq shard) says their cost went from $290/month when running in VMs
to $700/month when running in Kubernetes. They tried to use auto-scaling but
failed completely and ended up disabling it:

[https://gitlab.com/gitlab-com/gl-
infra/delivery/-/issues/920...](https://gitlab.com/gitlab-com/gl-
infra/delivery/-/issues/920#shard-performance)

Kubernetes looks like a massive LOSS in the case of this service. The
perceived scaling benefits didn't materialise at all, and their costs more
than doubled. They also spent a lot of engineering time optimising startup
time of this job: 7 PRs and complex writeups/testing were required. Just so
they could try to auto-scale a job to reduce the hit of the hugely multiplied
base costs. If not for Kubernetes that eng time might have been spent adding
features to the product instead.

 _It solves really problems out out the box like load balancing, ingress, cert
management, autoscaling, health checks, autorepair_

Most of these problems are created by Kubernetes, like "ingress" which is a
term only Kubernetes and low level networking uses, "cert management" which
often isn't required in more traditional setups beyond provisioning web
servers with the SSL files, "health checks" are a feature of any load balancer
and the whole point of VMs is to avoid the need to repair hardware. Finally
the demand auto-scaling as seen in this case is (a) possible without
Kubernetes and (b) not actually working well anyway.

Frankly this set of bugs, writeups, PRs etc is quite scary. I worked with Borg
at Google and saw how it was both a force multiplier but also a huge timesink
in many ways. It made some complex things simple but also made some simple
things absurdly complex. At one point I had a task to just serve a static
website on Borg, it turned into a three month nightmare because the
infrastructure was so fragile and Google-specific. A bunch of Linux VMs
running Apache would have been far faster to set up and run.

~~~
Keycap
Its really weird that your argument is based on just the cost.

I mean you mentioned yourself you worked at google right?

Perhaps you just haven't actually experienced the issues kubernetes is
solving?

How often have you seen that certificates expired? I have seen that. Plenty of
times. Its not an issue creating that lets encrypt cronjob, its still
something you need to do right.

Security Updates? Have you seen how many companies run with old non updated
VMs?

Disk full due to logs? Yes seeing this regularly.

Memory leak on a service and someone needs to restart it manually until
someone else fixes the issue? Yes!

Requesting a VM, hardware, infrastructure, getting it and the whole lifecycle
management of it in the backend? Its real.

Ansible Scripts, puppet or just bash scripts and a word document to tell you
how this magic machine was set up? Yepp.

Kubernetes solves all those problems.

Your static website on borg, if it sill runs, has probably still a valid
certificate, is running on a secure infrastructure, is equally configured on
every instance and not weird on 1 of 6 servers and just runs.

A smart person taking responsibility for all of this, costs you much more then
just a few hundred bucks a month. And you need that person. With Kubernetes,
this person now can manage and operate much more servers under his/her
fingertips better easier and more secure then if it would have been vms.

And in my personal experience: That shit runs more stable because that shit
can restart and being recreated and it just solves a handfull of shitty memory
or disk full issues.

~~~
thu2111
I've been running my own Linux servers for about twenty years, so I'm not
entirely unexperienced with these things.

Borg is/was great for running huge numbers of services at truly massive scale,
when those services were developed _entirely_ in house and done in the exact
way it wanted services to be done. It was a terrible cost the moment you
wanted to run anything third party or which wasn't written in that exact
Google way, and the costs were especially high if you didn't need to handle
huge traffic or data sizes.

Borg and Kubernetes don't auto-magically do sysadmin work. A ton of people
work in infrastructure at Google. Software updates still need to be applied
etc. In Kubernetes that means rebuilding Docker images (which in reality
doesn't happen as the tools for this are poor, so you just have a lot of
downlevel Ubuntu images floating around).

And worse, the whole K8s/Docker paradigm is totally backwards incompatible. At
Google this didn't matter because the software stack evolved in parallel with
the cluster management. Programs there expected local disk to be volatile,
expected to be killed at a moment's notice, expected datacenters to appear and
disappear like moles. They were written that way from scratch. But that came
with a terrible price: it was basically impossible to use ordinary open source
apps. You could import libraries and (carefully!) incorporate them into Borg-
ized projects, but that was about it.

When this tech was reimplemented and thrown over the wall to the community,
the cultural expectations that came with it didn't come with it. So I've seen
situations like, "whoops, we deleted our companies private key because it was
stored to local disk and then the Docker container was shut down, help!". This
was even reported as a bug in the software! No, the bug is that your computer
is randomly deleting critical files for no good reason, and normal software
does not expect that to happen. How about the way in a Dockerfile you have to
write 'apt-get update && apt-get upgrade' if you want a secure OS image when
the container is rebuilt? If you put the two commands on separate lines it
will _appear_ to be working, right up until you start getting weird errors
about missing files from Debian mirrors, because Docker assumes every single
command you run is a pure function! And then we get to security.

Screwups seem to follow Kubernetes/Docker around like flies. The tech is
complex and violates basic expectations programs have about how POSIX works.
When it goes wrong, it leads to mistakes in production.

Now there have been a few trends over time:

1\. Hardware has got a lot more powerful. It has been outpacing growth in the
internet and economy. Many more businesses fit in a smaller number of machines
than when Borg was designed (an era where 4-core systems were considered high
end).

2\. Cheap colo providers have been driving the cost of powerful VMs down to
the ground.

3\. Linux has got easier to administer.

These days setting up a bunch of Linux VMs that self-upgrade, run some
services via systemd etc isn't difficult, and properly tuned such a setup
should be able to serve a monster amount of traffic (watch out though for
Azure, which seems to overcommit capacity pretty drastically and their VMs
have very unstable performance).

Most businesses that are deploying Kubernetes today quite simply do not need a
million machines. Even companies that give away complex services like GitLab,
as we can see from this thread, they don't _really_ need huge scalability.
It's just nice to imagine that the business will experience explosive growth
and that growth is now automated, but ironically, the effort to automate
business infrastructure scaling takes away from the sort of efforts that
actually grow the business.

As for the other benefits, you can write a systemd unit that is much simpler
than Kubernetes configuration that will give you auto-restart, including if
memory limits are hit, you can view the activity of a cluster easily using
plain old SSH or something like Cockpit, it will handle log rotation for you
out of the box, sandboxing likewise, and so on. And of course the unattended-
upgrades package has existed for a while.

I agree you need someone to do admin work, whatever path you choose. Having
used Kubernetes, and the system it is based on, and plain old Linux, my
intuition is that the base cost of Kubernetes is too high for almost all its
users. Too many ways to screw it up, too many ways for it to go wrong, too
much time spent screwing around, and too expensive. If you become another
Google or Facebook then sure, go for it. Otherwise, better avoided.

~~~
Keycap
I'm not saying its not possible for small companies to run a handfull of VMs
for themself.

But you reach quickly enough an size where a central managed kubernetes
cluster is very versatile for your whole company.

1-2 People take care of the cluster, the other teams then use that kubernetes
cluster for your build, hosting of staging envs etc.

And then you have a very small infra team which is much better able to provide
those services to others internally much easier and safer.

------
solatic
> We found that with an application that increases its memory utilization over
> time, low requests (which reserves memory for each pod) and a generous hard
> limit on utilization was a recipe for node saturation and a high rate of
> evictions. To adjust for this we eventually decided to use higher requests
> and lower limit which took pressure off of the nodes and allowed pods to be
> recycled without putting too much pressure on the node.

Kubernetes works at its most efficient when you scale by adding more Pods,
rather than scaling within a Pod. Requests and limits should be equal to each
other, and a HorizontalPodAutoscaler (or KEDA Scales object) should be used to
add more Pods as utilization increases. Using Cluster Autoscaler then makes
sure that your cluster has sufficient Nodes to schedule the Pods. Then you
just need to set up alerting to tell you if the autoscaling reaches minimums
(indicating that you should consider lowering the minimum, to improve
efficiency) or maximums (indicating that you should consider increasing the
maximums), and you can leave things well enough alone, unless the speed at
which scaling occurs is insufficiently slow for your use case.

------
nuker
It reads like a press release, no gotchas, no complaints...

[https://k8s.af/](https://k8s.af/)

~~~
marinj
There are some complains that we can share, but honestly they are not really
that fun to read. We are still early in our transition and one major complaint
I have is about helm and lack of flexibility of it and operator being very
flexible but a huge time sink.

I think a lot of complaints we would usually have are reduced due to the fact
that we use managed K8s with GKE.

All other complaints I can think of are related to our application and some
architecture decisions we made early on, but that won't be interesting to
anyone but people at GitLab.

~~~
nuker
> complains that we can share, but honestly they are not really that fun to
> read.

"You are mistaken counselor" (c) The Descendants

------
zelphirkalt
I do have only very minimal xp in k8s, as I have tried to use it only once
(and never again). To me it seem like mOst of the comments are about the
unusable and unmaintainable yaml configuration files. This seems like exactly
the issue I had a few years ago, when I was cursing, because I could not find
a spec or docs for the yaml stuff in all the pages and pages of seemingly
important documentation and thus could not figure out how to transform a
docker only deployment into k8s deployment. I also remember having to set up
helm, according to whatever tutorial or official guide I was following. It was
all a distant nightmare, that I do not wish to repeat.

~~~
regularfry
There's definitely a need for an opinionated layer over the top, with much
simpler configuration. Probably more than one, with different use cases.
Nobody should have to wrangle that much yaml.

~~~
cpursley
[https://convox.com](https://convox.com)

It's basically heroku on top of k8s and you bring your own cloud.

------
ishcheklein
We are using k8s as well for a project that we also provide on-prem install
too. We decided to use kustomize for now vs doing helm for now. Curious, have
someone had experience with both to compare. What are the benefits of using
helm?

~~~
dpc_pw
`helm` is an absolute garbage. I'll reserve judgment about `kustomize` until I
have more practical experience, but so far it looks to me that it's going to
be another YAML disaster.

The problem with k8s ecosystem is that developers conflated k8s using yaml for
a semi-human-readable serialization of k8s resources, as reason to employ YAML
to absolutely everything.

What I'd like is to generate serialized yaml files ... you know ... with code.
Using some typed programming language, to be able to build any abstractions
required for the job, and get some checks, compiler errors, and even ability
to have asserts, unit-tests and so on. Instead, I have to dig YAML with a
pickaxe in the YAML-mine without any technology to assist me.

~~~
creztoe
Could you explain why helm is garbage? I think it suits its purpose rather
well without being too complex. You can essentially "plug-in" different types
of resources rather easily. Especially in v3 now that you don't need to
install Tiller and can avoid setting those cluster permission requirements.

Have you tried some Kubernetes api libraries? You can generate and configure
resources with [python kubernetes-client]([https://github.com/kubernetes-
client/python](https://github.com/kubernetes-client/python)) without much
trouble. Personally I prefer editing them as JSON instead of python objects,
but it isn't too bad.

~~~
Nullabillity
> Could you explain why helm is garbage?

Not the OP, but..

1\. YAML string templating makes it very easy to get indentation and/or
quotation wrong, and the error messages can easily end up pretty far from the
actual errors. Structured data should be generated with structured templating.

2\. "Values" aren't typechecked or cleaned.

3\. Easy to end up in a state where a failed deploy leaves you with a mess to
clean up by hand.

4\. No good way to preview what a deploy will change.

5\. Weird interactions when resources are edited manually (especially back in
Helm 2, but still a thing).

6\. No good way to migrate objects into a Helm chart without deleting and
recreating them.

7\. Tons of repetitive boilerplate in each chart to customize basic settings
(like replica counts).

It's a typical Go solution, in all the wrong ways.

~~~
alephu5
It's not going to solve all your problems but dhall can fix your first few
gripes. I've been using it for several months and it's an excellent way to
write configuration imo.

~~~
Nullabillity
Yeah, I have used Nix to generate them in the past, which worked pretty great
too. But Helm does, admittedly, solve a real problem: garbage collecting old
resources when they're deleted from the repo. I just wish we could have
something much simpler that only did that...

~~~
mkingston
`kubectl apply --prune` should nominally do this. Irritatingly (I acknowledge
I'm almost as responsible as anyone else for doing something about this), it's
had this disclaimer on it for quite some time now:

> Alpha Disclaimer: the --prune functionality is not yet complete. Do not use
> unless you are aware of what the current state is. See
> ⟨[https://issues.k8s.io/34274⟩](https://issues.k8s.io/34274⟩).

I haven't used it in anger, so I can't add any disclaimer or otherwise of my
own.

 _kpt_ is the recent Google-ordained (AFAICT) solution to this problem, but is
ot yet at v1.

You could also resolve this yourself by either:

* versioning with labels and deleting all resources with labels indicating older versions

* using helm _only_ in the last mile, for pruning

------
ChicagoDave
My question. Why not convert the rails code/APIs to blob/FaaS and skip all the
DevOps complexity?

Are containers providing enough long-term cost-effectiveness to a complete
cloud-native application architecture?

Or is it just mapping VMs to something “like” VMs so your topology remains
mostly the same?

~~~
auspex
Serverless has its own complexity. Deploying a smaller, lightweight purpose
built image, with no orchestration lock-in is the main benefit.

~~~
ChicagoDave
Those are fair infrastructure reasons, but not necessarily cost-effective.
There are always competing perspectives. What does the business want. What do
the developers want. What does the infrastructure team want. When power is
centralized in one area, decisions can be made that overlook those competing
purposes.

I see containers being adopted without openly discussing the options and long-
term consequences.

------
stepbeek
Was anyone else surprised by the timeline of this post? I was under the
impression that Gitlab had been fully on K8S for years now, given the hard
push of their K8S offering 18 months ago.

~~~
jarv
It has been possible to run GitLab the application on K8s for some time now
for the majority of self-managed deployments. There were some limitations
however, for running GitLab on K8s at the very large scale of GitLab.com that
we have been working through in the last year.

(disclaimer: blog post author)

~~~
stepbeek
> August 2019 when we migrated the GitLab Container Registry to Kubernetes,
> the first service to move. Though this was a critical and high traffic
> service, it was a good choice for the first migration because it is a
> stateless application with only a few external dependencies

I was under the impression that porting a large stateless service was
something that wasn't too bad to do with K8S? What limitations prevented this
from going ahead earlier?

~~~
jarv
Yeah, this is why we decided to migrate it first about a year ago, and we
probably could have done it sooner than that. Other services were not yet
ready to migrate until recently due to some blocking issues which was a factor
in the decision.

disclaimer: blog post author

------
Keycap
I'm not sure if i get it but normally internal traffic across AZs should be
included?

It was also not clear to me after reading their issue page.

