
A Kubernetes/GKE mistake that cost me thousands of dollars - dankohn1
https://medium.com/@gajus/mistake-that-cost-thousands-kubernetes-gke-2212ea663e1f
======
shrumm
The problem of learning by doing is that it's extremely hard to find good
tutorials designed for production. Most of what I find these days is 'hello
world'y and then you need some tool like Sentry to catch edge cases that don't
get caught in your limited testing.

I've 'rebuilt' our kubernetes cluster almost 3 times since I started by
applying lessons learned from running the last iteration for a few months.
It's just like anything else in software development, as you start your tech
debt is high mostly due to inexperience. Force yourself to reduce that debt
whenever you can.

As an example: the first version had a bunch of N1's (1 VCPU machines) with
hand written yaml files, no auto scaling. I had to migrate our database and
had a headache updating the DB connection string on each deployment. Then I
discovered external services, which let me define the DB hostname once.
([https://cloud.google.com/blog/products/gcp/kubernetes-
best-p...](https://cloud.google.com/blog/products/gcp/kubernetes-best-
practices-mapping-external-services)).

It's just to say with kubernetes, I think it's impossible to approach it
thinking you'll get it right the first time. Just dedicate more time to
monitoring at the beginning so you don't do anything 'too stupid' and take the
time to bake in what you learn to your cluster.

~~~
sschueller
These getting started tutorials are also very annoying for IT departments that
have developers pushing for containerization und not understanding that it
isn't just a few commands when you are responsible for operating them in
production. They aren't the ones that have to be on call in the middle of the
night and make sure it's all running.

~~~
dijit
Depends on your org. You should try applying SRE principles. Developers get
the pager until their application meets a defined criteria that you both agree
on with management but-in.

~~~
MoOmer
Sometimes, though, an infrastructure team will want to manage the service
internally, even if a managed service (e.g. GKE) is available and compliant
with the workloads, and would reduce overhead.

~~~
ianlevesque
Reduce overhead = eliminate their job

------
kingbirdy
It would be helpful if the author specified what the price difference actually
was here at the end of the day - their initial cost was $3,500/mo but they
don't mention how much they pay now after changing instance types.

~~~
auslander
I wonder what would be the cost without k8s setup and no containers at all,
pure compute nodes. I bet my last socks twice less, at least.

~~~
neurostimulant
With $3500/mo budget you could get 15 AMD servers on Hetzner, each with 32
cores and 128gb of RAM. You won't get autoscaling and other nice stuff
available in other cloud providers, but nothing beat dedicated servers in term
of price vs performance ratio.

~~~
Wintereise
Commoditized cloud is basically paying a company to run your ops for you.

Let's not forget that you'd also have to pay at least 3 (8 hours per shift,
not accounting for weekends) people to ensure 24/7 service availability
through those 15 Hetzner nodes.

AWS/GCP/Azure et al make it so you generally don't need to worry about the
infrastructure part of the problem. Everything else still applies though.

~~~
neurostimulant
> Let's not forget that you'd also have to pay at least 3 (8 hours per shift,
> not accounting for weekends) people to ensure 24/7 service availability
> through those 15 Hetzner nodes.

This is true if you run your own datacenter, but dedicated server providers
typically monitor their servers and immediately intervene if they detect any
connectivity/network issue (which is outside our control even if we use cloud
providers like AWS).

For individual server failure (disk issue, etc), how you handle failure
condition on dedicated servers is not much different than handling failure on
cloud servers (remove the bad server from the cluster, re-image a new server
and join it to the cluster). The main difference is provisioning new physical
server is not instant, so you'll need to plan ahead and either have some
spares or slightly over-provision your cluster (so you can take down a few
nodes without degrading your service). You can do this automatically or
manually, not much different than using cloud servers.

Using dedicated servers is not as scary as what people thought it would be.

------
le_didil
I would advise reading this section of the GKE docs. It explains the marginal
gains in allocatable memory and cpu from running larger nodes
[https://cloud.google.com/kubernetes-
engine/docs/concepts/clu...](https://cloud.google.com/kubernetes-
engine/docs/concepts/cluster-architecture#memory_cpu)

 _For memory resources, GKE reserves the following:

255 MiB of memory for machines with less than 1 GB of memory 25% of the first
4GB of memory 20% of the next 4GB of memory (up to 8GB) 10% of the next 8GB of
memory (up to 16GB) 6% of the next 112GB of memory (up to 128GB) 2% of any
memory above 128GB_

 _For CPU resources, GKE reserves the following:

6% of the first core 1% of the next core (up to 2 cores) 0.5% of the next 2
cores (up to 4 cores) 0.25% of any cores above 4 cores_

------
markbnj
I think the main issue this author ran into was caused by setting the CPU
requests so far below the limits. I get that he was trying to communicate the
spiky nature of the workload to the control plane, but I think it would have
been better to reserve CPU to cover the spikes by setting the requests higher,
especially on a 1 core node.

It's important to grok the underlying systems here, imo. CPU requests map to
the cpushares property of the container's cpu,cpuacct cgroup. A key thing
about cpushares is that it guarantees a minimum number of 1/1024 shares of a
core, but doesn't prevent a process from consuming more if they're available.
The CPU limit uses the CPU bandwidth control scheduler, which specifies a
number of slices per second (default 100k) and a share of those slices which
the process cannot, afaik, exceed. So by setting the request to 20m and the
limit to 200m the author left a lot of room for pods that look like they fit
fine under normal operating conditions to spike up and consume all the CPU
resources on the machine. K8s is supposed to reserve resources for the kubelet
and other components on a per node basis but I'm not surprised it's possible
to place these components under pressure using settings like these on a 1 core
node.

~~~
brown9-2
If I am understanding the article right, this would have led to fewer pods per
node and a higher overall cost, wouldn’t it have? The author claims that the
extra headroom allowed per pod for spiky CPU usage helped absorb those spikes.

~~~
markbnj
> If I am understanding the article right, this would have led to fewer pods
> per node and a higher overall cost, wouldn’t it have?

To be clear I don't know exactly how the k8s scheduler weighs CPU requests vs.
limits in fitting pods to nodes. It's something I've wanted to dig further
into. I do know basically how the two underlying control systems function. The
cpushares system (K8S CPU requests) cannot prevent a process from taking more
shares. The CPU bandwidth control system (K8S CPU limits) will throttle a
process at the upper limit, but processes will not be evicted by the kubelet
for hitting this limit. So if you have a pod with requests set to 20m and
limits set to 200m, those pods are able to take 200m. If the scheduler is
using limits to fit pods then maybe you can get 3-4 of these on a 1 core node
and leave enough CPU for the other system components. If it's using some
weighted combination of limits and requests then it might place more than 3-4
pods on that node, each of which is then permitted to take up to 200m. Just a
theory, and I am sure there are some folks here who know exactly how it works.
Maybe we'll hear from someone.

~~~
brown9-2
The scheduler uses requests alone:
[https://kubernetes.io/docs/concepts/configuration/manage-
com...](https://kubernetes.io/docs/concepts/configuration/manage-compute-
resources-container/#how-pods-with-resource-requests-are-scheduled)

------
Thaxll
It's missing something crucial not mentioned, 100x nodes vs 1 means you have
the overhead of running Kubernetes on those 100x nodes which is actually high
( kubelets ect ... ) on one node you just have 1core used by kube, the rest is
available for your app.

~~~
llarsson
Agreed. Although kubelet itself is not too terrible, all the other stuff you
need to run alongside it (the "etc" part of your post) is what costs you.
Network provider, per host monitoring, reserved resources per host, just to
name a few.

Constantly adding and removing hosts can also negatively affect e.g. network
provider, depending on which you use. In my experience, Weave worked
significantly worse than something "simpler" like flannel when combined with
frequent auto scaling.

~~~
drenvuk
So lets say that you were building a bandwidth heavy service, on the best
providers each node is limited to around a 1gbps port per each 1GB 1vcpu node.
If the goal is to maximize total data transferred my thought would be that
it's better to have as many nodes as possible. Sending data at the rate cap
shouldn't be that big a hit on the cpu with big enough chunks even when
considering tls costs. But I'm not sure about this. It always seems that
people are trying to maximize their compute capabilities and not their
throughput when they talk about Kubernetes but I've never really had that
focus.

What would you do if you were serving tons of data but didn't have to compute
much?

------
rcarmo
I've seen this kind of thing happen a number of times, and it's good to remind
ourselves that oversubscribing resources is still a good way to tackle the
"padding" related to scaling.

I have been playing with an autoscaling k3s cluster
([https://github.com/rcarmo/azure-k3s-cluster](https://github.com/rcarmo/azure-k3s-cluster))
in order to figure out the right way to scale up compute nodes depending on
pod requirements, and even though the Python autoscaler I am noodling with is
just a toy, I'm coming to the conclusion that all the work involved in using
the Kubernetes APIs to figure out pod requirements and deciding whether to
spawn a new VM based on that is barely more efficient than just using 25%
"padding" in CPU metrics to trigger autoscaling with standard Azure
automation, at least for batch workloads (I run Blender jobs on my toy cluster
to keep things simple).

YMMV, but it's fun to reminisce that oversubscription was _the_ way we dealt
with running multiple services on VMware stacks, since it was very rare to
have everything need all the RAM or IO at once.

------
jarfil
That's less of a K8s issue and more of a general multiprocessing issue. Would
you rather have:

* 96x single-core CPUs with no multithreading

* 1x 96-core CPU with multithreading, but running all cores at full power all the time

* 1x 96-core CPU that can turn off sets of 16 cores at a time when they're not in use.

~~~
malkia
It depends. If that beefy machine dies, needs reboot, or the kernel can't
scale up to server all these containers?

But mostly if it dies...

------
epiphone
Here's a whole collection of Kubernetes bloopers:
[https://github.com/hjacobs/kubernetes-failure-
stories](https://github.com/hjacobs/kubernetes-failure-stories). I for one am
glad people are sharing!

------
rainyMammoth
I read the whole thing and couldn't tell what is that "mistake that costs
thousands?"

~~~
notyourday
In a cloud, X * Y != 1x (X * Y)

Actually, X * Y is _massively higher_ than 1x (X * Y)

------
asdfasgasdgasdg
This is all very interesting, but one thing that occurs to me is: why are
there so many idle pods? Is there any way to fold the work that is currently
being done in multiple different pods into one pod? Perhaps via multiple
threads, or even just plain single-threaded concurrency? Unless there is some
special tenancy requirement, that might be the most efficient way to deal with
this situation.

~~~
zbentley
The author alluded to this by referencing task queues. It's not uncommon to
have task queue workers listening on "bursty" queues which are usually quiet,
with the constraint that when work arrives in the queue it should be picked up
_really quickly_ (I.e. no waiting for program, pod/container, or hardware
startup). If you have multiple task queues and sets of workers for different
kinds of work (I don't know if the author does, but it is definitely a common
pattern), then you can easily end up with a decent number of idle pods sitting
around.

This isn't unique to k8s either; all sorts of queue-worker-oriented
deployments have this issue.

The wasted idle capacity can be mitigated by having separate "burst" capacity
(idle workers sitting around waiting for work) and "post-burst" capacity (a
bunch of new workers that get created in response to a detected backlog on a
queue), but orchestrating that is complicated: how much of a backlog merits
the need for post-burst workers to begin starting? Instead, can the normal
burst workers pay it down fast enough that no new instances/pods need to be
scheduled? Do your post-burst workers always start up at the same rate (hint:
as your application/dependency stack grows, they start slower and slower)? How
do you define SLOs/SLAs for a service with a two-tiered scale-out behavior
like this (some folks are content with just a max time-to-processed SLA based
on how long it takes a post-burst worker to come online and consume the oldest
message on the queue at a given point in time, other workloads have more
demanding requirements for the worst/average cases)?

In many cases, just keeping the peak-scale amount of idle workers sitting
around waiting for queue bursts is cheaper (from an operational/engineering
time perspective) than building something that satisfactorily answers those
questions.

------
tdurden
It is difficult to determine what exactly the 'mistake' was in this post.

~~~
edaemon
They were using high-resource nodes and made the mistake of shifting to low-
resource nodes for fine-grained control over the total amount of resources
provisioned -- i.e. a single 96-unit node vs 96 1-unit nodes. That was a
problem because the low-resource nodes were much less efficient at processing
the actual load; a portion of the resources for each node were allocated to
k8s system processes, but also any idle pod consumed a higher proportion of
the node's resources to do nothing. As a result the autoscaling functions
provisioned even more resources than they were using with the high-resource
nodes. The solution was to use medium-resource nodes that offered somewhat
granular control but made efficient use of the available resources.

------
jackcodes
One thing that I’ve been unable to wrap my head around is how to effectively
calculate the right CPU share values for single threaded web-servers.

I’ve got a project using this setup, but it’s fairly common one l- e.g.
Express with node clustering, Puma on Rails etc. On Kubernetes you obviously
just forgo the clustering ability and let k8s handle the concurrency and
routing for you.

So in this instance, I’m struggling to see why I wouldn’t request a value of
1vCPU for each process. My thinking is that my program is already single
threaded, and asking the kubernetes CPU scheduler to spread resource between
multiple single threaded processes is pure overhead. At that point I should
allow each process to run to its full capacity. Is that correct?

This I feel gets a lot more complex now that my chosen language, DB drivers,
and web framework is just starting to support multithreading. That’s a can of
worms I can’t begin to figure out how to allocate for - 2vCPU’s each? Does
anyone know?

~~~
chmod775
JavaScript, the language, is single threaded, but node and V8 are not.

Depending on how much (and how heavy) async work you're doing it might be
reasonable to let a node process use multiple cores.

~~~
jackcodes
Good point - although it’s worth clarifying that I’m on Crystal for this,
meaning I’m definitely single threaded in this instance. Would that be as
simple as a case of 1vCPU per pod?

~~~
YawningAngel
At $job we default to 1vCPU per pod with the option to ask for more if that
makes sense (e.g. you use a lot of shared heap and can meaningfully
multithread).

~~~
jackcodes
Thanks, this was almost exactly the brief validation I was looking for.

------
specialp
K8s nodes are going to work much better with more CPUs (to a certain extent).
As the post said when you have idle and busting pods, you need headroom. If
you have single CPU nodes, your bursting pods are going to more often
oversubscribe the node as the pool size is smaller. If you had 3 pods per CPU
and on average one bursts out of 3 out during any time period, There's a
chance you can have 2, or 3 go and cause pods to be evicted and moved. But if
they were on 16 CPU nodes it averages out more. Also single CPU clusters will
still need to run their network layer, kublet etc.

As far as the 96 CPU instance that really isn't good either unless your pods
were all taking 1+ CPUs each. Even then, I'd rather run 6 x 16 CPU. There's a
pod limit cap of ~110 per node, and also not to mention the loss of
redundancy. I find 16-32 CPU nodes the best balance.

~~~
markbnj
> As far as the 96 CPU instance that really isn't good either unless your pods
> were all taking 1+ CPUs each. Even then, I'd rather run 6 x 16 CPU. There's
> a pod limit cap of ~110 per node, and also not to mention the loss of
> redundancy. I find 16-32 CPU nodes the best balance.

Agreed, this is the other side of the node sizing question. At the low end you
have to consider what your most resource hungry workloads need (and we use
nodepools to partition our particularly edge-casey things), and then at the
upper end you don't want the failure of one node to take out half your stuff.

------
lazyant
This is more of a procedural mistake than a specific technical
(Kubernetes/GKE) mistake, even if the tech stack is the root cause.

This is a capacity planning or "right-sizing" problem. In prod you just don't
go and flip completely your layout (100 1vCPU servers vs 1 100vCPU server or
whatever) an more so in a stack you are not yet expert on, you change a bit
and then measure. Actually you try to figure this out first in a dev
environment if possible.

------
jayd16
I'm really struggling with connecting his conclusion to what we know of his
workload. Can someone spell it out for me?

He has many idle pods with a bursty workload.

The author says they need to reserve a lot of cpu or containers fail to
create. Why is this? Wouldn't memory be more likely a cause for the failure?
How does lack of CPU cause a failure?

Later the author notes that a many core machine is good for his workload
because "pods can be more tightly packed." How does that follow? A pod using
above the reserved resources will bump up against the other pods on that
physical machine whether you've virtualized it as a standard-1 or standard-16.
Is there a cost savings because the unreserved ram over-provisioned? Wouldn't
that overbooking be dangerous if you had uniform load across all the pods in a
standard-16.

Said another way, why is resource contention with yourself in a standard-16
better or cheaper than with others in the standard-1 pool?

My understanding with going the vCPU options was simply the choice between
pricing granularity and CPU overheard of k8s.

~~~
dilyevsky
It’s a bit hard to unpack as there seem to be multiple unrelated things there
but i think the gist is classic problem of smaller nodes causing more resource
fragmentation. The solution to increase node size is also classic and easy
enough in cloud environment but has its tradeoffs as well (like lower
reliability)

------
cpitman
Choosing the right size for nodes comes up often enough that I blogged some
rough guidelines last year:
[http://cpitman.github.io/openshift/2018/03/21/sizing-
openshi...](http://cpitman.github.io/openshift/2018/03/21/sizing-openshift-
and-kubernetes-nodes.html)

------
abiro
This article makes such a strong case for ditching k8s for serverless:

\- needs granular scaling

\- devops expertise is not core to the business

\- save developer time

~~~
k__
I thought the same. Even more so after reading the comments here.

"K8s is so hard! All the tutorials are too basic! I had to redo my cluster
multiple times!"

------
pythonwutang
> Therefore, the best thing to do is to deploy your program without resource
> limits, observe how your program behaves during idle/ regular and peak
> loads, and set requested/ limit resources based on the observed values.

This is one of the author’s fatal assumption. The best practice I understand
is to set cpu requests to be around 80% of peak and limits to 120% of peak
_before_ deploying to prod.

They set themselves up for disaster with this architecture where they have
many idle pods polling for resource availability. This resource monitoring
should have been delegated to a single pod.

Also it’s really unclear what specific strategy led to extra costs of 1000s of
dollars...

------
dgoog
what is the psychosis in the k8s community where they feel the need to talk
about losing thousands of dollars? it's a recurring theme with this community
that they think somehow makes them look cool - wouldn't that be a clear sign
that they should not be using k8s to begin with?

this community is ripe for implosion - what a joke

~~~
ben_jones
Truly learning kubernetes requires running it as a production system which
runs the risk of incurring costs accidentally.

~~~
dgoog
learning to develop a good drug habit runs the risk of incurring large costs
as well - what's your point?

------
olalonde
Anyone knows what the "CPU(cores)" means exactly (e.g. 83m)? What's that m
unit?

~~~
speedplane
> Anyone knows what the "CPU(cores)" means exactly (e.g. 83m)? What's that m
> unit?

Not sure of the unit, but it generally means 83% of a single CPU's processing
power. My understanding is that this is not strictly enforced, it's just a
tool for Kubernetes to schedule pods and to make sure no set of pods add up to
more than 100m on any CPU.

~~~
olalonde
I actually just found the relevant Kubernetes documentation [0]. It stands for
"millicores", so 83m would be 8.3% of a single CPU.

[0] [https://kubernetes.io/docs/concepts/configuration/manage-
com...](https://kubernetes.io/docs/concepts/configuration/manage-compute-
resources-container/)

------
dilyevsky
Repeat this mantra every morning:

Never set cpu limits, always set mem request=limit unless you _really_ have
good reason not to.

~~~
rossmohax
CPU limits are essential for keeping runaway pods from taking over whole node.
Except pathological cases it works very well with lowered cfs_period_us

~~~
dilyevsky
This is wrong - cpu request already constrains them using cgroups cpu shares.
All it does is fucks up your latency and wastes resources for underutilized
nodes.

Edit: also please advise how can i tune cfs period on gke

~~~
rossmohax
requests do not constrain anything, it more or less specifies proportion CPU
time is allocated when contention occurs. 50m req pod can be cut off from CPU
completely by runaway 500m pod to the point it starts timing out and failing
readiness pod. Or worse these effects start to be seen on kubelet, which
almost nobody runs with a low enough GOMAXPROCS. Setting limits keeps node
healthy.

~~~
dilyevsky
It will constrain it to its proportional share, as you noted, but only during
contention (which is a feature not a bug). Thus 50m pod will get 1/10th of
cycles of 500m pod which is WAI. In case of fully subscribed node former pod
will get exactly 50/(1000 * number of allocatable cores) share so I don’t see
how that would cause issues provided that pod can actually survive on such
small slice in the first place.

Kubelet has its own cgroup in hierarchy above pods and should set its cpu
shares there as well (most cloud providers already do this).

------
egdod
> Pardon the interruption. We see you’ve read on Medium before - there’s a
> personalized experience waiting just a few clicks away. Ready to make Medium
> yours?

Why in the world do I need an account to read a glorified blog? It’s text
data, I should be able to consume it with curl if I’m so inclined.

~~~
lscotte
Medium is terrible, I don't quite understand why people allow a company to
monetize their blog for medium's gain. But, that's their choice, not mine.

~~~
derwiki
Traffic

~~~
arthurcolle
Content discovery via Medium itself is extremely poor, so this isn't really a
complete answer on its face.

~~~
rocqua
The ability to handle traffic is what was meant.

~~~
cameronbrown
I don't think it was. Handling blog traffic is not difficult.

------
test100
Sorry to hear this.

------
rvz
> As a disclaimer, I will add that this is a really stupid mistake and shows
> my lack of experience managing auto-scaling deployments.

There is a reason why these DevOps certifications exist in the first place and
why it is a huge risk for a company to spend lots of time and money on
training to learn such a complex tool like Kubernetes (Unless they are
preparing for a certification). Perhaps it would be better to hire a
consultant skilled in the field rather than using it blind and creating these
mistakes later.

When mistakes like this occur and go unnoticed for a long time, it racks up
and creates unnecessary costs which amount as much as $10k/month which
depending on the company budget can be very expensive and can make or break a
company.

Unless you know what you are doing, don't touch tools you don't understand.

~~~
notyourday
> There is a reason why these DevOps certifications exist in the first place
> and why it is a huge risk for a company to spend lots of time and money on
> training to learn such a complex tool like Kubernetes (Unless they are
> preparing for a certification). Perhaps it would be better to either hired a
> consultant skilled in the field rather than using it blind and creating
> these mistakes later.

That's laughable but I will play:

I will pay anyone with a devops cert $0.01 for a right to 10% of my savings
over a year period. If I end up paying more for the service after hiring such
person, that person will pay me 110% of the excess that I paid for the service
as a result of hiring them. If a devops cert is actually any good then this
would be a license to print money for anyone with a devops cert.

OP's problem is that his organization did not engage in any sort of risk
management which is why they had

a) K8s as something magical that makes things work

b) Someone who did not know how K8s works being allowed to re-engineers K8s

c) No alert on a change of the usage data exported by Google

P.S. If you are on a cloud, drop everyting and implement the (c). It will save
your shirt donzens if not hundreds of times a year.

~~~
fwip
Nah, you've probably just got a fragile configuration that won't scale and
will cost you money in downtime, lost sales, or failure to live up to
contract.

An engineer who does it right isn't going to save you much money over your
best case scenario - but they're going to keep you from losing millions in the
worst case scenarios.

~~~
notyourday
> Nah, you've probably just got a fragile configuration that won't scale and
> will cost you money in downtime, lost sales, or failure to live up to
> contract.

Those are the tales consultants and engineers that like to play with toys
tell: it is a typical case for a premature optimization. The odds of you
having enough traffic that needs to scale are slim to none.

If you do need to scale, the odds are your apps are over-engineered on corner
cases and under-engineered in the main path: if your ORM takes 300 ms to
initialize on every request without fetching any data from the database
"scaling" is the last thing you should be worried about.

> An engineer who does it right isn't going to save you much money over your
> best case scenario - but they're going to keep you from losing millions in
> the worst case scenarios.

You will go out of business before those savings are going to matter.

